fix: deflake //rs/tests/message_routing:global_reboot_test by basvandijk · Pull Request #8979 · dfinity/ic

basvandijk · 2026-02-21T13:18:59Z

The //rs/tests/message_routing:global_reboot_test is almost 5% flaky in the last week:

$ bazel run //ci/githubstats:query -- top 1 flaky --include //rs/tests/message_routing:global_reboot_test --week
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┑
│    │ label                                         │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners   │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┥
│  0 │ //rs/tests/message_routing:global_reboot_test │     126 │             6 │       6 │         0 │      0 │            4.8 │      4.8 │          0 │       0 │    36:00 │         12:36:00 │           6:00 │ team-dsm │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┙

All flaky runs failed in the same way:

$ bazel run //ci/githubstats:query -- last --flaky --week //rs/tests/message_routing:global_reboot_test --columns=last_started_at,errors
...
Downloading logs to: /ic/logs/global_reboot_test/2026-02-21T13:29:05
...
╒════╤═════════════════════════╤═════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                  │
╞════╪═════════════════════════╪═════════════════════════════════════════════════════╡
│  0 │ Sat 2026-02-21 00:08:07 │ 1: test: assertion failed: responses_pre_reboot > 0 │
├────┼─────────────────────────┼─────────────────────────────────────────────────────┤
│  1 │ Fri 2026-02-20 11:17:20 │ 1: test: assertion failed: responses_pre_reboot > 0 │
├────┼─────────────────────────┼─────────────────────────────────────────────────────┤
│  2 │ Fri 2026-02-20 05:13:10 │ 1: test: assertion failed: responses_pre_reboot > 0 │
├────┼─────────────────────────┼─────────────────────────────────────────────────────┤
│  3 │ Wed 2026-02-18 12:44:14 │ 1: test: assertion failed: responses_pre_reboot > 0 │
├────┼─────────────────────────┼─────────────────────────────────────────────────────┤
│  4 │ Mon 2026-02-16 15:43:52 │ 1: test: assertion failed: responses_pre_reboot > 0 │
├────┼─────────────────────────┼─────────────────────────────────────────────────────┤
│  5 │ Sun 2026-02-15 12:09:56 │ 1: test: assertion failed: responses_pre_reboot > 0 │
╘════╧═════════════════════════╧═════════════════════════════════════════════════════╛

Claude Opus 4.6 concluded the following Root Cause Analysis and accompanying fix (manually improved further):

Root Cause

The test waits a fixed 15 seconds (MSG_EXEC_TIME_SEC) for cross-subnet (xnet) messages to complete a full round trip before collecting pre-reboot metrics. On busy CI machines, xnet message routing latency can exceed this fixed wait, resulting in zero responses being recorded for some canisters. This causes the assertion responses_pre_reboot > 0 to fail.

All 6 flaky runs in the past week failed with the same error: assertion failed: responses_pre_reboot > 0.

Fix

The test now waits until every canister receives at least 100 responses. Then reboots. Then waits until every canister receives at least 100 responses on top of what was observed before the reboot. (There's a race condition in there, but 100 responses / 10 rounds should pretty much ensure that at least some of these latter 100 are responses to requests sent after the reboot. I.e. that we got a few roundtrips after the replica reboot.)

Added helper functions responses_count() and all_canisters_made_progress() to reduce code duplication.

Finally, as claimed but not actually implemented originally, it checks that no calls failed, were rejected or arrived out of order.

Skill

.claude/skills/fix-flaky-tests/SKILL.md

Replace fixed 15-second sleep before collecting pre-reboot metrics with a polling loop (using retry_with_msg_async!) that waits until all xnet canisters have received at least one response (up to 120 seconds). Root cause: The test waits a fixed 15 seconds for cross-subnet (xnet) messages to complete a full round trip. On busy CI machines, xnet message routing latency can exceed this, resulting in zero responses being recorded for some canisters, causing the assertion `responses_pre_reboot > 0` to fail. The fix sleeps MSG_EXEC_TIME_SEC (15s) initially, then polls metrics every 5 seconds using the standard retry_with_msg_async! macro until all canisters show at least one response, with a total timeout of 120s. Also improves assertion messages to include subnet/canister indices and metrics values for easier debugging. Skill: .claude/skills/fix-flaky-tests/SKILL.md

rs/tests/message_routing/global_reboot_test.rs

…tep 7 as well

…eived, before and after reboot. Actually check that no call errors occurred despite the replicas rebooting.

github-actions bot added the fix label Feb 21, 2026

basvandijk marked this pull request as ready for review February 21, 2026 13:31

basvandijk requested a review from a team as a code owner February 21, 2026 13:31

github-actions bot added the @team-dsm label Feb 21, 2026

alin-at-dfinity reviewed Feb 23, 2026

View reviewed changes

rs/tests/message_routing/global_reboot_test.rs Outdated Show resolved Hide resolved

rs/tests/message_routing/global_reboot_test.rs Outdated Show resolved Hide resolved

basvandijk and others added 4 commits February 23, 2026 08:43

Remove the initial 15 sleep

7c89bf6

abstract collect_metrics_with_responses in a function and use it in s…

6018ed3

…tep 7 as well

Wait and retry until a minumum (additional) number of requests is rec…

47f2d25

…eived, before and after reboot. Actually check that no call errors occurred despite the replicas rebooting.

Make clippy happy.

5423014

michael-weigelt approved these changes Feb 23, 2026

View reviewed changes

basvandijk enabled auto-merge February 23, 2026 14:34

basvandijk added this pull request to the merge queue Feb 23, 2026

Merged via the queue into master with commit 6498c7b Feb 23, 2026
41 checks passed

basvandijk deleted the ai/deflake-global_reboot_test branch February 23, 2026 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: deflake //rs/tests/message_routing:global_reboot_test#8979

fix: deflake //rs/tests/message_routing:global_reboot_test#8979
basvandijk merged 5 commits intomasterfrom
ai/deflake-global_reboot_test

basvandijk commented Feb 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

basvandijk commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Fix

Skill

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

basvandijk commented Feb 21, 2026 •

edited

Loading