Skip to content

Comments

feat(syslog source): add lossy option for UTF-8 handling#24657

Open
mezgerj wants to merge 6 commits intovectordotdev:masterfrom
mezgerj:syslog-lossy
Open

feat(syslog source): add lossy option for UTF-8 handling#24657
mezgerj wants to merge 6 commits intovectordotdev:masterfrom
mezgerj:syslog-lossy

Conversation

@mezgerj
Copy link

@mezgerj mezgerj commented Feb 15, 2026

Summary

 Adds a `lossy` configuration option to the syslog source to handle messages containing invalid UTF-8 byte sequences without dropping
 them. When enabled, invalid bytes are replaced with U+FFFD (Unicode replacement character) instead of dropping the entire message.

 **Implementation:**
 - Added `lossy` field to `SyslogConfig` (defaults to `false`)
 - Implemented lossy UTF-8 handling in `OctetCountingDecoder` framing layer
 - Propagated setting through TCP, UDP, and Unix socket transports
 - Added comprehensive tests for all three transport modes

 **Motivation:**
 Currently, messages with invalid UTF-8 are dropped with "Failed framing bytes. error=Unable to decode input as UTF8" errors. This is
 problematic for legacy systems, binary data, or corrupted transmissions. This PR provides an opt-in way to preserve these messages.

 ## Vector configuration

 ```toml

Test configuration used for validation

 [sources.syslog_strict]
   type = "syslog"
   mode = "tcp"
   address = "0.0.0.0:5514"
   lossy = false  # Default: drops invalid UTF-8

 [sources.syslog_lossy]
   type = "syslog"
   mode = "tcp"
   address = "0.0.0.0:5515"
   lossy = true   # New: replaces invalid UTF-8 with U+FFFD

 [sinks.console]
   type = "console"
   inputs = ["syslog_strict", "syslog_lossy"]
   encoding.codec = "json"

     ## How did you test this PR?

     ### Unit Tests
     Added three comprehensive tests:
     - **`test_tcp_syslog_lossy`**: TCP with octet-counting framing
     - **`test_udp_syslog_lossy`**: UDP with bytes framing
     - **`test_unix_stream_syslog_lossy`**: Unix socket with octet-counting

     All tests send syslog messages with invalid UTF-8 sequences (`\xE4\x80\xFF\xFE`) and verify:
     - Messages are processed (not dropped) when `lossy = true`
     - `char::REPLACEMENT_CHARACTER` (U+FFFD) appears in output
     - All 28 syslog source tests pass

     ```bash
     cargo test --package vector --lib sources::syslog --no-fail-fast
     # Result: 28 passed; 0 failed
     ```

     ### Integration Tests
     Manually tested with Vector running locally:

     **Test 1 - Strict mode (`lossy = false`):**
     - Sent message with invalid UTF-8 to port 5514
     - ✅ Message dropped
     - ✅ Error logged: "Failed framing bytes. error=Unable to decode message as UTF8"
     - ✅ Successfully replicates issue #20462

     **Test 2 - Lossy mode (`lossy = true`):**
     - Sent same invalid UTF-8 message to port 5515
     - ✅ Message processed successfully
     - ✅ JSON output: `{"message": "Test with non-UTF8: \ufffd\ufffd\ufffd", ...}`
     - ✅ No error logged

     **Test 3 - Valid UTF-8 (regression):**
     - Sent valid UTF-8 with Unicode characters to both ports
     - ✅ Both modes processed correctly
     - ✅ No regressions

     **Test 4 - Mixed valid/invalid UTF-8:**
     - Sent "Hello \xFF\xFE World" to lossy port
     - ✅ Output: "Hello \ufffd\ufffd World"
     - ✅ Valid parts preserved, invalid replaced

     ### Build & Check
     ```bash
     cargo check    # Passed
     cargo clippy   # No new warnings
     cargo build --release  # Successful
     ```

     ## Change Type
     - [ ] Bug fix
     - [x] New feature
     - [ ] Non-functional (chore, refactoring, docs)
     - [ ] Performance

     ## Is this a breaking change?
     - [ ] Yes
     - [x] No

     ## Does this PR include user facing changes?

     - [x] Yes. Please add a changelog fragment based on our
     [guidelines](https://github.com/vectordotdev/vector/blob/master/changelog.d/README.md).
     - [ ] No. A maintainer will apply the `no-changelog` label to this PR.

     **Changelog fragment needed:** `enhancement` for adding new `lossy` configuration option to syslog source.

     ## References

     - Closes: #20462
     - Related patterns in existing codecs:
       - JSON codec: `lib/codecs/src/decoding/format/json.rs`
       - GELF codec: `lib/codecs/src/decoding/format/gelf.rs`
       - InfluxDB codec: `lib/codecs/src/decoding/format/influxdb.rs`

     ## Design Decisions

     ### Default Value: `false`
     While other Vector codecs default `lossy` to `true`, this implementation defaults to `false`:
     - **Explicit opt-in**: Makes behavior change intentional
     - **Stricter validation**: Maintains data integrity by default
     - **Visibility**: Makes encoding issues visible rather than silently transforming data

     Users who need lossy mode can explicitly enable it in their configuration.

     ### Transport Coverage
     The implementation handles lossy UTF-8 at two layers:
     - **Framing layer** (octet-counting): TCP and Unix sockets
     - **Deserialization layer**: All modes (TCP, UDP, Unix)

     This ensures consistent behavior across all transport types.

     ## Notes
     - [x] Ran `make fmt` - No formatting issues
     - [x] Ran `make check-clippy` - No new clippy warnings
     - [x] Ran `make test` - All tests pass (28 syslog tests)
     - [x] Up-to-date with latest master
     - [x] Comprehensive documentation added with structured sections
     - [ ] No changes to `Cargo.lock` dependencies (no license update needed)

Add lossy configuration option to handle messages with invalid UTF-8
sequences. When enabled, invalid bytes are replaced with U+FFFD instead
of dropping the entire message.

Includes comprehensive tests for TCP, UDP, and Unix socket transports.

Fixes vectordotdev#20462
@mezgerj mezgerj requested a review from a team as a code owner February 15, 2026 05:14
@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label Feb 15, 2026
@github-actions
Copy link

github-actions bot commented Feb 15, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@mezgerj
Copy link
Author

mezgerj commented Feb 15, 2026

recheck

1 similar comment
@mezgerj
Copy link
Author

mezgerj commented Feb 15, 2026

recheck

@mezgerj
Copy link
Author

mezgerj commented Feb 17, 2026

recheck

@mezgerj
Copy link
Author

mezgerj commented Feb 17, 2026

@jszwedko Any idea what I can do to get the CLA job to pass? I have signed the CLA

@jszwedko
Copy link
Collaborator

jszwedko commented Feb 17, 2026

@jszwedko Any idea what I can do to get the CLA job to pass? I have signed the CLA

Ah, did you leave a comment like the bot asks? I'm not seeing it. Note the CLA bot did change somewhat recently to require a PR comment to "sign the CLA" vs. the previous implementation that had a form (in case you had seen that one before).

Screenshot:

Screenshot 2026-02-17 at 3 48 50 PM

@mezgerj
Copy link
Author

mezgerj commented Feb 18, 2026

I have read the CLA Document and I hereby sign the CLA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sources Anything related to the Vector's sources

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants