NO-ISSUE: monitor/haproxy: treat all-apiservers-down test as flake#31343
Conversation
The haproxy all-apiservers-down monitor test was recently enabled to actually fail, but its detection logic is still being tuned. In the meantime it blocks payload acceptance for pre-existing conditions. Return both a failure and a success junit when the test fails, which the CI system treats as a flake. This lets us track flake rates in Sippy without blocking nightly payloads while we refine the detection sensitivity. Signed-off-by: Mateusz Kowalski <mko@redhat.com> Generated-by: AI Signed-off-by: Mateusz Kowalski <mko@redhat.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@mkowalski: This pull request references Jira Issue OCPBUGS-9209, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughWhen a full kube-apiserver outage is detected, the outage evaluator now emits both a failing JUnit case and a passing JUnit case. The corresponding test now checks the updated JUnit count and failure-output shape for success and failure paths. ChangesFull API outage JUnit handling
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/approve |
|
@mkowalski: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified later @mkowalski |
|
@mkowalski: This PR has been marked to be verified later by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitortests/network/onpremhaproxy/monitortest_test.go`:
- Around line 267-277: The flake-marker assertions in the monitortest need to
match the downstream contract, not just the junit count and FailureOutput.
Update the checks around the junits returned in the failure path so the passing
junit is verified to have the same Name as the failing junit and a nil
SkipMessage, using the existing junits slice in monitortest_test.go to ensure
countRealFailures will classify it as a flake.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 00dad4f9-aa9e-478f-b4f8-0cf844a1840e
📒 Files selected for processing (2)
pkg/monitortests/network/onpremhaproxy/monitortest.gopkg/monitortests/network/onpremhaproxy/monitortest_test.go
|
/retest-required |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mkowalski, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Scheduling required tests: |
|
/override ci/prow/e2e-vsphere-ovn |
|
@mkowalski: Overrode contexts on behalf of mkowalski: ci/prow/e2e-aws-ovn-fips, ci/prow/e2e-gcp-csi, ci/prow/e2e-vsphere-ovn, ci/prow/e2e-vsphere-ovn-upi DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mkowalski: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What
The haproxy
Haproxy should not encounter all kube apiservers down simultaneouslymonitor test was recently enabled to actually fail (previously it always passed), but its detection logic is still being tuned. In the meantime it blocks payload acceptance for pre-existing conditions that need to be slowly improved.Why
5.0 nightly payloads have been blocked by this test failing on metal-ipi and vsphere jobs. A sensitivity fix was merged in #31326 (tolerate install-phase bounces), but dgoodwin suggested we also add a flake fallback so the test can't hard-block payloads while we validate and iterate.
How
Return both a failure and a success junit when the test fails. The CI system treats a test that both fails and passes as a flake rather than a hard failure. This is the same pattern already used by the sibling test (
Haproxy must be able to reach kubeapi server) in the same file.This lets us:
Once the detection logic is stable and false positives are eliminated, we can remove the flake fallback and let the test hard-fail again.
Context
Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1782395783960919
🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.
Summary by CodeRabbit