fix: deflake //rs/nns/integration_tests:integration_tests_src/node_provider_remuneration#8981
Conversation
…ovider_remuneration The test was flaky because governance's automated reward minting flow requires many inter-canister calls (NRC, CMC, Ledger, self-call) spanning 8+ ticks, but only 2 ticks were provided after each monthly metrics loop. Additionally, the retry loops used advance_time() without tick(), so governance's run_periodic_tasks never actually ran during retries. Root cause: - After feeding 31 days of daily node metrics in the inner loop, governance detects it's time to mint rewards during the final wait_for_nrc_metrics_sync ticks - The NRC may not have synced all needed days yet at that point, causing the mint to fail with "No metrics found for day X" - After wait_for_nrc_metrics_sync exits, only 2 explicit ticks were provided - insufficient for the full inter-canister call chain - The subsequent retry loops called advance_time() and query() but never tick(), so governance never got another chance to run periodic tasks and complete the minting Fix: - Add a wait_for_rewards_minting() helper that keeps calling tick() until governance successfully completes the full reward minting flow (detected by a change in the most_recent_monthly_node_provider_rewards timestamp) - Replace all insufficient 2-tick + non-ticking-retry patterns with calls to this helper - Remove the seconds_advanced_test_2 tracking that compensated for time advanced in the now-removed retry loops
There was a problem hiding this comment.
This pull request changes code owned by the Governance team. Therefore, make sure that
you have considered the following (for Governance-owned code):
-
Update
unreleased_changelog.md(if there are behavior changes, even if they are
non-breaking). -
Are there BREAKING changes?
-
Is a data migration needed?
-
Security review?
How to Satisfy This Automatic Review
-
Go to the bottom of the pull request page.
-
Look for where it says this bot is requesting changes.
-
Click the three dots to the right.
-
Select "Dismiss review".
-
In the text entry box, respond to each of the numbered items in the previous
section, declare one of the following:
-
Done.
-
$REASON_WHY_NO_NEED. E.g. for
unreleased_changelog.md, "No
canister behavior changes.", or for item 2, "Existing APIs
behave as before.".
Brief Guide to "Externally Visible" Changes
"Externally visible behavior change" is very often due to some NEW canister API.
Changes to EXISTING APIs are more likely to be "breaking".
If these changes are breaking, make sure that clients know how to migrate, how to
maintain their continuity of operations.
If your changes are behind a feature flag, then, do NOT add entrie(s) to
unreleased_changelog.md in this PR! But rather, add entrie(s) later, in the PR
that enables these changes in production.
Reference(s)
For a more comprehensive checklist, see here.
GOVERNANCE_CHECKLIST_REMINDER_DEDUP
No canister changes.
…vider_remuneration
Address review comment from jasonz-dfinity: add advance_time(1s) before each tick() in wait_for_rewards_minting so that timer-based periodic tasks are also triggered, making the test more robust and realistic.
|
@jasonz-dfinity since the PR is already 2 weeks old I asked Claude Opus 4.6 if it still addresses the flaky runs that happened in the last week: PR #8981 Flaky Test AnalysisFlaky Runs Summary (last week: 6 runs, 8 failing attempts)
Two Failure Patterns — Same Root CauseBoth patterns trace back to the Node Rewards Canister (NRC) not being synced when governance tries to mint rewards:
PR #8981's FixThe PR introduces Yes, PR #8981 still addresses both failure patterns:
One Minor ConcernThe PR removes the ConclusionThe PR is still relevant and should substantially reduce or eliminate the observed flakiness. The reviewer @jasonz-dfinity's suggestion to include |
The test
//rs/nns/integration_tests:integration_tests_src/node_provider_remunerationis slightly flaky:Claude Opus 4.6 made the follow Root Cause Analysis and accompanying fix:
Root Cause
The test was flaky because governance's automated reward minting flow requires many inter-canister calls (NRC, CMC, Ledger, self-call) spanning 8+ ticks, but only 2 ticks were provided after each monthly metrics loop. Additionally, the retry loops used
advance_time()withouttick(), so governance'srun_periodic_tasksnever actually ran during retries.Failure Patterns
Detailed Flow
wait_for_nrc_metrics_synctickswait_for_nrc_metrics_syncexits, only 2 explicit ticks were provided - insufficient for the full inter-canister call chainadvance_time()but nevertick(), so governance never ran periodic tasks during retriesFix
wait_for_rewards_minting()helper that keeps callingtick()until governance completes the full reward minting flowseconds_advanced_test_2tracking that compensated for the now-removed retry loopsSkill:
.claude/skills/fix-flaky-tests/SKILL.md.