[fix][test] Fix flaky OneWayReplicatorDeduplicationTest.testDeduplication by Praveenkumar76 · Pull Request #25679 · apache/pulsar

Praveenkumar76 · 2026-05-05T17:29:38Z

Motivation

OneWayReplicatorDeduplicationTest.testDeduplication is flaky because MessageDeduplication.takeSnapshot is designed to drop concurrent snapshot requests if an existing snapshot operation is already in progress. The test relies on specific positions being persisted, but if a manual trigger is dropped due to this intentional "lossy" design, the test fails with a ConditionTimeoutException while waiting for the state to update.

Modifications

Modified OneWayReplicatorDeduplicationTest to set brokerDeduplicationSnapshotIntervalSeconds to 1 second in the test configuration.
By increasing the frequency of the background snapshot monitor, the system is guaranteed to capture and persist the deduplication state frequently enough to satisfy the test's assertions, even if a specific manual trigger request is skipped due to a race condition.

Verifying this change

Verified that the flaky test now runs consistently without failures.
Confirmed that the 1-second background interval provides a sufficient "safety net" to persist the state when concurrent manual triggers are dropped.
Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as:

org.apache.pulsar.broker.service.OneWayReplicatorDeduplicationTest.testDeduplication

Does this pull request potentially affect one of the following parts:

lhotari · 2026-05-05T21:07:52Z

The root cause is a race condition in MessageDeduplication.takeSnapshot. When multiple snapshot requests occur concurrently, the current implementation uses a compareAndSet guard to allow only one active snapshot. If another request arrives while a snapshot is already in progress, the method immediately returns a completed future, effectively dropping the new request.

The title of the PR should reflect that this is a production code change, not a fix to a flaky test.

lhotari

I don't think that MessageDuplication needs to be changed. The test itself is invalid and should be fixed instead. Here's the explanation of MessageDuplication, answer from DeepWiki:

MessageDeduplication.takeSnapshot(Position) — Role and Correctness

Role of the method

The private takeSnapshot(Position) in org.apache.pulsar.broker.service.persistent.MessageDeduplication persists the current deduplication state to a ManagedCursor. The snapshot captures the highest sequence IDs persisted per producer (the highestSequencedPersisted map) and marks the cursor at the supplied Position, which corresponds to the last confirmed entry in the ManagedLedger. This is what allows deduplication state to be recovered correctly after a broker restart or topic unload/reload.

When it's invoked

Periodically, after snapshotInterval entries have been persisted (counter-driven, from recordMessagePersistedNormal / recordMessagePersistedRepl).

After replaying the cursor on a deduplication status check, if enough entries were processed.

After purging inactive producers, to persist the trimmed state.

From a scheduled task driven by brokerDeduplicationSnapshotFrequencyInSeconds / brokerDeduplicationSnapshotIntervalSeconds, via the public takeSnapshot() wrapper.

From BacklogQuotaManager when the deduplication cursor is the slowest consumer and needs to be advanced.

Correctness requirements when one is already in progress

Concurrency is gated by an AtomicBoolean snapshotTaking. If a snapshot is requested while another is in flight, the new request is dropped (not queued) and a warning is logged — there is no waiting and no replacement. The contract relies on two things to stay correct under that drop:

The snapshot reads highestSequencedPersisted at invocation time and pairs it with the supplied Position (last confirmed entry), so each snapshot is internally consistent.

The underlying ManagedCursor.markDelete is monotonic — it never moves the cursor backward — so even though concurrent requests are skipped, missing one is safe: a later snapshot will advance the cursor to a position at least as far as the dropped one would have. Combined with the periodic/scheduled triggers, the cursor is guaranteed to keep moving forward.

The practical implication: callers must not assume their requested snapshot actually ran. The mechanism is designed to be lossy-but-monotonic — fine for periodic persistence, but not something to rely on for "snapshot exactly at this position right now" semantics.

Source: DeepWiki query on apache/pulsar

One possible way to address the flaky test problem is to configure brokerDeduplicationSnapshotIntervalSeconds to a low value in the test instead of relying on brokerDeduplicationEntriesInterval. This shouldn't change the intention of the test at all.

Praveenkumar76 · 2026-05-06T04:48:19Z

Thanks for the detailed explanation, @lhotari.

My original change aimed to resolve the ConditionTimeoutException by ensuring the test never missed a snapshot signal, but I now understand that the 'lossy-but-monotonic' design is intentional for production performance.

I agree that fixing the test configuration is the better approach here, I will:

Revert the changes to MessageDeduplication.java.

Update the test to use a lower brokerDeduplicationSnapshotIntervalSeconds as suggested to ensure frequent snapshots during verification.

Update the PR title/description to reflect a test stabilization change.

Pushing the update shortly.

…tion

lhotari

LGTM

lhotari · 2026-05-06T09:09:51Z

Thanks for the contribution @Praveenkumar76

Praveenkumar76 · 2026-05-06T10:52:09Z

/pulsarbot rerun

…tion (apache#25679)

lhotari requested changes May 5, 2026

View reviewed changes

Praveenkumar76 changed the title ~~[fix][broker] Fix flaky OneWayReplicatorDeduplicationTest by coalescing snapshot requests~~ [fix][test] Fix flaky OneWayReplicatorDeduplicationTest.testDeduplication May 6, 2026

[fix][test] Fix flaky OneWayReplicatorDeduplicationTest.testDeduplica…

57403b2

…tion

Praveenkumar76 force-pushed the fix/flaky-deduplication-test branch from 2cc1fee to 57403b2 Compare May 6, 2026 07:08

Praveenkumar76 requested a review from lhotari May 6, 2026 07:10

lhotari approved these changes May 6, 2026

View reviewed changes

lhotari merged commit 9e40b55 into apache:master May 6, 2026
79 of 81 checks passed

lhotari added release/4.1.4 release/4.2.2 release/4.0.11 labels May 6, 2026

poorbarcode pushed a commit to poorbarcode/pulsar that referenced this pull request May 6, 2026

[fix][test] Fix flaky OneWayReplicatorDeduplicationTest.testDeduplica…

e624070

…tion (apache#25679)

Praveenkumar76 deleted the fix/flaky-deduplication-test branch May 6, 2026 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][test] Fix flaky OneWayReplicatorDeduplicationTest.testDeduplication#25679

[fix][test] Fix flaky OneWayReplicatorDeduplicationTest.testDeduplication#25679
lhotari merged 1 commit intoapache:masterfrom
cognitree:fix/flaky-deduplication-test

Praveenkumar76 commented May 5, 2026 •

edited

Loading

Uh oh!

lhotari commented May 5, 2026

Uh oh!

lhotari left a comment

Uh oh!

Praveenkumar76 commented May 6, 2026 •

edited

Loading

Uh oh!

lhotari left a comment

Uh oh!

lhotari commented May 6, 2026

Uh oh!

Praveenkumar76 commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Praveenkumar76 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Uh oh!

lhotari commented May 5, 2026

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

MessageDeduplication.takeSnapshot(Position) — Role and Correctness

Role of the method

When it's invoked

Correctness requirements when one is already in progress

Uh oh!

Praveenkumar76 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

lhotari commented May 6, 2026

Uh oh!

Praveenkumar76 commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Praveenkumar76 commented May 5, 2026 •

edited

Loading

`MessageDeduplication.takeSnapshot(Position)` — Role and Correctness

Praveenkumar76 commented May 6, 2026 •

edited

Loading