fix: deflake //rs/tests/message_routing:global_reboot_test#8979
Merged
basvandijk merged 5 commits intomasterfrom Feb 23, 2026
Merged
fix: deflake //rs/tests/message_routing:global_reboot_test#8979basvandijk merged 5 commits intomasterfrom
basvandijk merged 5 commits intomasterfrom
Conversation
Replace fixed 15-second sleep before collecting pre-reboot metrics with a polling loop (using retry_with_msg_async!) that waits until all xnet canisters have received at least one response (up to 120 seconds). Root cause: The test waits a fixed 15 seconds for cross-subnet (xnet) messages to complete a full round trip. On busy CI machines, xnet message routing latency can exceed this, resulting in zero responses being recorded for some canisters, causing the assertion `responses_pre_reboot > 0` to fail. The fix sleeps MSG_EXEC_TIME_SEC (15s) initially, then polls metrics every 5 seconds using the standard retry_with_msg_async! macro until all canisters show at least one response, with a total timeout of 120s. Also improves assertion messages to include subnet/canister indices and metrics values for easier debugging. Skill: .claude/skills/fix-flaky-tests/SKILL.md
…eived, before and after reboot. Actually check that no call errors occurred despite the replicas rebooting.
michael-weigelt
approved these changes
Feb 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
//rs/tests/message_routing:global_reboot_testis almost 5% flaky in the last week:All flaky runs failed in the same way:
Claude Opus 4.6 concluded the following Root Cause Analysis and accompanying fix (manually improved further):
Root Cause
The test waits a fixed 15 seconds (
MSG_EXEC_TIME_SEC) for cross-subnet (xnet) messages to complete a full round trip before collecting pre-reboot metrics. On busy CI machines, xnet message routing latency can exceed this fixed wait, resulting in zero responses being recorded for some canisters. This causes the assertionresponses_pre_reboot > 0to fail.All 6 flaky runs in the past week failed with the same error:
assertion failed: responses_pre_reboot > 0.Fix
The test now waits until every canister receives at least 100 responses. Then reboots. Then waits until every canister receives at least 100 responses on top of what was observed before the reboot. (There's a race condition in there, but 100 responses / 10 rounds should pretty much ensure that at least some of these latter 100 are responses to requests sent after the reboot. I.e. that we got a few roundtrips after the replica reboot.)
Added helper functions
responses_count()andall_canisters_made_progress()to reduce code duplication.Finally, as claimed but not actually implemented originally, it checks that no calls failed, were rejected or arrived out of order.
Skill
.claude/skills/fix-flaky-tests/SKILL.md