Skip to content

Comments

fix: deflake //rs/tests/message_routing:global_reboot_test#8979

Merged
basvandijk merged 5 commits intomasterfrom
ai/deflake-global_reboot_test
Feb 23, 2026
Merged

fix: deflake //rs/tests/message_routing:global_reboot_test#8979
basvandijk merged 5 commits intomasterfrom
ai/deflake-global_reboot_test

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Feb 21, 2026

The //rs/tests/message_routing:global_reboot_test is almost 5% flaky in the last week:

$ bazel run //ci/githubstats:query -- top 1 flaky --include //rs/tests/message_routing:global_reboot_test --week
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┑
│    │ label                                         │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners   │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┥
│  0 │ //rs/tests/message_routing:global_reboot_test │     126 │             6 │       6 │         0 │      0 │            4.8 │      4.8 │          0 │       0 │    36:00 │         12:36:00 │           6:00 │ team-dsm │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┙

All flaky runs failed in the same way:

$ bazel run //ci/githubstats:query -- last --flaky --week //rs/tests/message_routing:global_reboot_test --columns=last_started_at,errors
...
Downloading logs to: /ic/logs/global_reboot_test/2026-02-21T13:29:05
...
╒════╤═════════════════════════╤═════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                  │
╞════╪═════════════════════════╪═════════════════════════════════════════════════════╡
│  0 │ Sat 2026-02-21 00:08:07 │ 1: test: assertion failed: responses_pre_reboot > 0 │
├────┼─────────────────────────┼─────────────────────────────────────────────────────┤
│  1 │ Fri 2026-02-20 11:17:20 │ 1: test: assertion failed: responses_pre_reboot > 0 │
├────┼─────────────────────────┼─────────────────────────────────────────────────────┤
│  2 │ Fri 2026-02-20 05:13:10 │ 1: test: assertion failed: responses_pre_reboot > 0 │
├────┼─────────────────────────┼─────────────────────────────────────────────────────┤
│  3 │ Wed 2026-02-18 12:44:14 │ 1: test: assertion failed: responses_pre_reboot > 0 │
├────┼─────────────────────────┼─────────────────────────────────────────────────────┤
│  4 │ Mon 2026-02-16 15:43:52 │ 1: test: assertion failed: responses_pre_reboot > 0 │
├────┼─────────────────────────┼─────────────────────────────────────────────────────┤
│  5 │ Sun 2026-02-15 12:09:56 │ 1: test: assertion failed: responses_pre_reboot > 0 │
╘════╧═════════════════════════╧═════════════════════════════════════════════════════╛

Claude Opus 4.6 concluded the following Root Cause Analysis and accompanying fix (manually improved further):

Root Cause

The test waits a fixed 15 seconds (MSG_EXEC_TIME_SEC) for cross-subnet (xnet) messages to complete a full round trip before collecting pre-reboot metrics. On busy CI machines, xnet message routing latency can exceed this fixed wait, resulting in zero responses being recorded for some canisters. This causes the assertion responses_pre_reboot > 0 to fail.

All 6 flaky runs in the past week failed with the same error: assertion failed: responses_pre_reboot > 0.

Fix

The test now waits until every canister receives at least 100 responses. Then reboots. Then waits until every canister receives at least 100 responses on top of what was observed before the reboot. (There's a race condition in there, but 100 responses / 10 rounds should pretty much ensure that at least some of these latter 100 are responses to requests sent after the reboot. I.e. that we got a few roundtrips after the replica reboot.)

Added helper functions responses_count() and all_canisters_made_progress() to reduce code duplication.

Finally, as claimed but not actually implemented originally, it checks that no calls failed, were rejected or arrived out of order.

Skill

.claude/skills/fix-flaky-tests/SKILL.md

Replace fixed 15-second sleep before collecting pre-reboot metrics with
a polling loop (using retry_with_msg_async!) that waits until all xnet
canisters have received at least one response (up to 120 seconds).

Root cause: The test waits a fixed 15 seconds for cross-subnet (xnet)
messages to complete a full round trip. On busy CI machines, xnet
message routing latency can exceed this, resulting in zero responses
being recorded for some canisters, causing the assertion
`responses_pre_reboot > 0` to fail.

The fix sleeps MSG_EXEC_TIME_SEC (15s) initially, then polls metrics
every 5 seconds using the standard retry_with_msg_async! macro until
all canisters show at least one response, with a total timeout of 120s.

Also improves assertion messages to include subnet/canister indices and
metrics values for easier debugging.

Skill: .claude/skills/fix-flaky-tests/SKILL.md
@github-actions github-actions bot added the fix label Feb 21, 2026
@basvandijk basvandijk marked this pull request as ready for review February 21, 2026 13:31
@basvandijk basvandijk requested a review from a team as a code owner February 21, 2026 13:31
basvandijk and others added 4 commits February 23, 2026 08:43
@basvandijk basvandijk enabled auto-merge February 23, 2026 14:34
@basvandijk basvandijk added this pull request to the merge queue Feb 23, 2026
Merged via the queue into master with commit 6498c7b Feb 23, 2026
41 checks passed
@basvandijk basvandijk deleted the ai/deflake-global_reboot_test branch February 23, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants