Skip to content

Comments

fix: deflake //rs/tests/consensus:cup_explorer_test#8992

Open
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-cup_explorer_test
Open

fix: deflake //rs/tests/consensus:cup_explorer_test#8992
basvandijk wants to merge 1 commit intomasterfrom
ai/deflake-cup_explorer_test

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Feb 22, 2026

The //rs/tests/consensus:cup_explorer_test is slightly flaky:

$ bazel run //ci/githubstats:query -- top 1 flaky% --week --include //rs/tests/consensus:cup_explorer_test
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━┯━━━━━━━━━━━┑
│    │ label                                  │   total │   non_success │   flaky │   timeout │   fail │   non_success% │   flaky% │   timeout% │   fail% │   impact │   total duration │   duration_p90 │ owners    │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━━┿━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━┿━━━━━━━━━━━┥
│  0 │ //rs/tests/consensus:cup_explorer_test │     121 │             3 │       3 │         0 │      0 │            2.5 │      2.5 │          0 │       0 │    13:33 │          9:06:31 │           4:31 │ consensus │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━┷━━━━━━━━━━━┙

Failing in mostly similar ways:

$ bazel run //ci/githubstats:query -- last --flaky --week //rs/tests/consensus:cup_explorer_test --columns=last_started_at,errors
...
╒════╤═════════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                                                           │
╞════╪═════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Fri 2026-02-20 15:30:44 │ 1: test: called `Result::unwrap()` on an `Err` value: "Registry CUPs cannot be verified w... │
├────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│  1 │ Thu 2026-02-19 18:09:33 │ 1: test: called `Result::unwrap()` on an `Err` value: "Registry CUPs cannot be verified w... │
├────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┤
│  2 │ Thu 2026-02-19 14:49:28 │ 1: test: called `Result::unwrap()` on an `Err` value: "HttpClient: Request timed out for ... │
╘════╧═════════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════╛

Claude Opus 4.6 made the following Root Cause Analysis and accompanying fixes:

Root Cause

The cup_explorer_test flakes because the initial CUP downloaded from an application subnet may be a genesis CUP with NiDkgTargetSubnet::Remote. The verify() function calls get_subnet_id() which returns an error for Remote target subnets, and the unwrap() causes a panic.

This happens when the test runs before the app subnet has produced enough blocks to create a CUP with Local DKG target (typically needs 2+ DKG intervals from genesis). In failing runs, the CUP was at height 15 (first DKG interval), while passing runs had CUPs at height 75+.

A secondary flaky failure mode is HTTP request timeouts when querying the NNS node during verification.

Fix

  1. Change verify() in ic_cup_explorer to return Result<SubnetStatus, String> instead of panicking on errors (Remote DKG target, integrity check failures, signature verification failures, registry queries). This makes it possible to retry on transient failures.

  2. Wrap the initial `explore() + verify()` calls in the test with `retry_with_msg!` to wait until a CUP with Local DKG target is available, consistent with how the halted-state check is already retried.


Generated following .claude/skills/fix-flaky-tests/SKILL.md.

The cup_explorer_test flakes because the initial CUP downloaded from an
application subnet may be a genesis CUP with NiDkgTargetSubnet::Remote.
The verify() function calls get_subnet_id() which returns an error for
Remote target subnets, and the unwrap() causes a panic.

This happens when the test runs before the app subnet has produced enough
blocks to create a CUP with Local DKG target (typically needs 2+ DKG
intervals from genesis). In failing runs, the CUP was at height 15
(first DKG interval), while passing runs had CUPs at height 75+.

The fix has two parts:

1. Change verify() in ic_cup_explorer to return Result<SubnetStatus, String>
   instead of panicking on errors, making it possible to retry on transient
   failures including Remote DKG target CUPs and HTTP timeouts.

2. Wrap the initial explore() + verify() calls in the test with
   retry_with_msg! to wait until a CUP with Local DKG target is available,
   consistent with how the halted-state check is already retried.
@github-actions github-actions bot added the fix label Feb 22, 2026
@basvandijk
Copy link
Collaborator Author

@dfinity/consensus this is outside my area of expertise so please review carefully!

@basvandijk basvandijk marked this pull request as ready for review February 22, 2026 14:03
@basvandijk basvandijk requested a review from a team as a code owner February 22, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant