fix: deflake //rs/sns/integration_tests:upgrade_canister_test#8993
fix: deflake //rs/sns/integration_tests:upgrade_canister_test#8993jasonz-dfinity merged 1 commit intomasterfrom
Conversation
Replace wall-clock-based age assertions with timestamp consistency checks using the proposal's own internal timestamps. Root cause: The test_upgrade_canister_proposal_execution_fail test used SystemTime::now() (wall clock) to compute how long ago the proposal was decided/failed. However, the state machine's internal time (used for proposal timestamps) is initialized at test start and does not advance during the SNS setup phase, which compiles WASM binaries and takes 60-100+ seconds of wall-clock time. This caused decision_age_s to equal the variable setup time rather than the actual proposal processing time, exceeding the 60-second threshold. Fix: Compare decided_timestamp_seconds and failed_timestamp_seconds against proposal_creation_timestamp_seconds (all in the same state machine time domain), eliminating the wall-clock dependency entirely. Skill: .claude/skills/fix-flaky-tests/SKILL.md
There was a problem hiding this comment.
This pull request changes code owned by the Governance team. Therefore, make sure that
you have considered the following (for Governance-owned code):
-
Update
unreleased_changelog.md(if there are behavior changes, even if they are
non-breaking). -
Are there BREAKING changes?
-
Is a data migration needed?
-
Security review?
How to Satisfy This Automatic Review
-
Go to the bottom of the pull request page.
-
Look for where it says this bot is requesting changes.
-
Click the three dots to the right.
-
Select "Dismiss review".
-
In the text entry box, respond to each of the numbered items in the previous
section, declare one of the following:
-
Done.
-
$REASON_WHY_NO_NEED. E.g. for
unreleased_changelog.md, "No
canister behavior changes.", or for item 2, "Existing APIs
behave as before.".
Brief Guide to "Externally Visible" Changes
"Externally visible behavior change" is very often due to some NEW canister API.
Changes to EXISTING APIs are more likely to be "breaking".
If these changes are breaking, make sure that clients know how to migrate, how to
maintain their continuity of operations.
If your changes are behind a feature flag, then, do NOT add entrie(s) to
unreleased_changelog.md in this PR! But rather, add entrie(s) later, in the PR
that enables these changes in production.
Reference(s)
For a more comprehensive checklist, see here.
GOVERNANCE_CHECKLIST_REMINDER_DEDUP
No canister changes.
//rs/sns/integration_tests:upgrade_canister_testwas flaky this week:The test failed 3 times in a similar way:
Specifically the
test_upgrade_canister_proposal_execution_failfailed 3 times on the following assert:where
EXPECTED_SNS_DAPP_CANISTER_UPGRADE_TIME_SECONDS = 60with an error like:Claude Opus 4.6 made the following Root Cause Analysis and accompanying fix:
Root Cause
The
test_upgrade_canister_proposal_execution_failtest usedSystemTime::now()(wall clock) to compute how long ago the proposal was decided/failed. However, the state machine's internal time (used for proposal timestamps) is initialized at test start viawith_current_time()and does not advance during the SNS setup phase, which compiles WASM binaries and takes 60-100+ seconds of wall-clock time.This caused
decision_age_sto equal the variable setup time rather than the actual proposal processing time, exceeding the 60-second threshold (EXPECTED_SNS_DAPP_CANISTER_UPGRADE_TIME_SECONDS).From flaky run logs, the observed values were:
decision_age_s: 103(setup took 101.3s)decision_age_s: 69(setup took 67.1s)decision_age_s: 71(setup took 69.1s)Fix
Replace the wall-clock-based
age_s()function with timestamp consistency checks that comparedecided_timestamp_secondsandfailed_timestamp_secondsagainstproposal_creation_timestamp_seconds. All three timestamps live in the same state machine time domain, eliminating the wall-clock dependency entirely.Skill:
.claude/skills/fix-flaky-tests/SKILL.md.