[fix][test] Fix flaky OneWayReplicatorDeduplicationTest.testDeduplication#25679
Conversation
The title of the PR should reflect that this is a production code change, not a fix to a flaky test. |
lhotari
left a comment
There was a problem hiding this comment.
I don't think that MessageDuplication needs to be changed. The test itself is invalid and should be fixed instead. Here's the explanation of MessageDuplication, answer from DeepWiki:
MessageDeduplication.takeSnapshot(Position)— Role and CorrectnessRole of the method
The private
takeSnapshot(Position)inorg.apache.pulsar.broker.service.persistent.MessageDeduplicationpersists the current deduplication state to aManagedCursor. The snapshot captures the highest sequence IDs persisted per producer (thehighestSequencedPersistedmap) and marks the cursor at the suppliedPosition, which corresponds to the last confirmed entry in theManagedLedger. This is what allows deduplication state to be recovered correctly after a broker restart or topic unload/reload.When it's invoked
- Periodically, after
snapshotIntervalentries have been persisted (counter-driven, fromrecordMessagePersistedNormal/recordMessagePersistedRepl).- After replaying the cursor on a deduplication status check, if enough entries were processed.
- After purging inactive producers, to persist the trimmed state.
- From a scheduled task driven by
brokerDeduplicationSnapshotFrequencyInSeconds/brokerDeduplicationSnapshotIntervalSeconds, via the publictakeSnapshot()wrapper.- From
BacklogQuotaManagerwhen the deduplication cursor is the slowest consumer and needs to be advanced.Correctness requirements when one is already in progress
Concurrency is gated by an
AtomicBoolean snapshotTaking. If a snapshot is requested while another is in flight, the new request is dropped (not queued) and a warning is logged — there is no waiting and no replacement. The contract relies on two things to stay correct under that drop:
- The snapshot reads
highestSequencedPersistedat invocation time and pairs it with the suppliedPosition(last confirmed entry), so each snapshot is internally consistent.- The underlying
ManagedCursor.markDeleteis monotonic — it never moves the cursor backward — so even though concurrent requests are skipped, missing one is safe: a later snapshot will advance the cursor to a position at least as far as the dropped one would have. Combined with the periodic/scheduled triggers, the cursor is guaranteed to keep moving forward.The practical implication: callers must not assume their requested snapshot actually ran. The mechanism is designed to be lossy-but-monotonic — fine for periodic persistence, but not something to rely on for "snapshot exactly at this position right now" semantics.
Source: DeepWiki query on apache/pulsar
One possible way to address the flaky test problem is to configure brokerDeduplicationSnapshotIntervalSeconds to a low value in the test instead of relying on brokerDeduplicationEntriesInterval. This shouldn't change the intention of the test at all.
|
Thanks for the detailed explanation, @lhotari. My original change aimed to resolve the I agree that fixing the test configuration is the better approach here, I will: Revert the changes to MessageDeduplication.java. Update the test to use a lower brokerDeduplicationSnapshotIntervalSeconds as suggested to ensure frequent snapshots during verification. Update the PR title/description to reflect a test stabilization change. Pushing the update shortly. |
2cc1fee to
57403b2
Compare
|
Thanks for the contribution @Praveenkumar76 |
|
/pulsarbot rerun |
Fixes apache#25141
Motivation
OneWayReplicatorDeduplicationTest.testDeduplicationis flaky becauseMessageDeduplication.takeSnapshotis designed to drop concurrent snapshot requests if an existing snapshot operation is already in progress. The test relies on specific positions being persisted, but if a manual trigger is dropped due to this intentional "lossy" design, the test fails with aConditionTimeoutExceptionwhile waiting for the state to update.Modifications
Modified
OneWayReplicatorDeduplicationTestto setbrokerDeduplicationSnapshotIntervalSecondsto 1 second in the test configuration.By increasing the frequency of the background snapshot monitor, the system is guaranteed to capture and persist the deduplication state frequently enough to satisfy the test's assertions, even if a specific manual trigger request is skipped due to a race condition.
Verifying this change
This change is already covered by existing tests, such as:
org.apache.pulsar.broker.service.OneWayReplicatorDeduplicationTest.testDeduplicationDoes this pull request potentially affect one of the following parts: