Skip to content

NO-ISSUE: monitor/haproxy: treat all-apiservers-down test as flake#31343

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
mkowalski:fix/haproxy-all-down-flake-fallback
Jun 26, 2026
Merged

NO-ISSUE: monitor/haproxy: treat all-apiservers-down test as flake#31343
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
mkowalski:fix/haproxy-all-down-flake-fallback

Conversation

@mkowalski

@mkowalski mkowalski commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What

The haproxy Haproxy should not encounter all kube apiservers down simultaneously monitor test was recently enabled to actually fail (previously it always passed), but its detection logic is still being tuned. In the meantime it blocks payload acceptance for pre-existing conditions that need to be slowly improved.

Why

5.0 nightly payloads have been blocked by this test failing on metal-ipi and vsphere jobs. A sensitivity fix was merged in #31326 (tolerate install-phase bounces), but dgoodwin suggested we also add a flake fallback so the test can't hard-block payloads while we validate and iterate.

How

Return both a failure and a success junit when the test fails. The CI system treats a test that both fails and passes as a flake rather than a hard failure. This is the same pattern already used by the sibling test (Haproxy must be able to reach kubeapi server) in the same file.

This lets us:

  • Track flake rates in Sippy (Tests page → Flakes View)
  • Avoid blocking payload acceptance
  • Continue iterating on detection sensitivity

Once the detection logic is stable and false positives are eliminated, we can remove the flake fallback and let the test hard-fail again.

Context

Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1782395783960919


🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

  • Bug Fixes
    • Full API outage detection now reports both an alerting result and a non-blocking success marker, reducing the chance that transient outage signals block normal progress.
    • Improved result handling so passing cases are validated more strictly and outage cases are tracked more consistently.

The haproxy all-apiservers-down monitor test was recently enabled to
actually fail, but its detection logic is still being tuned. In the
meantime it blocks payload acceptance for pre-existing conditions.

Return both a failure and a success junit when the test fails, which
the CI system treats as a flake. This lets us track flake rates in
Sippy without blocking nightly payloads while we refine the detection
sensitivity.

Signed-off-by: Mateusz Kowalski <mko@redhat.com>
Generated-by: AI
Signed-off-by: Mateusz Kowalski <mko@redhat.com>
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 26, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mkowalski: This pull request references Jira Issue OCPBUGS-9209, which is invalid:

  • expected the bug to be open, but it isn't
  • expected the bug to target the "5.0.0" version, but no target version was set
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Closed (Done) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What

The haproxy Haproxy should not encounter all kube apiservers down simultaneously monitor test was recently enabled to actually fail (previously it always passed), but its detection logic is still being tuned. In the meantime it blocks payload acceptance for pre-existing conditions that need to be slowly improved.

Why

5.0 nightly payloads have been blocked by this test failing on metal-ipi and vsphere jobs. A sensitivity fix was merged in #31326 (tolerate install-phase bounces), but dgoodwin suggested we also add a flake fallback so the test can't hard-block payloads while we validate and iterate.

How

Return both a failure and a success junit when the test fails. The CI system treats a test that both fails and passes as a flake rather than a hard failure. This is the same pattern already used by the sibling test (Haproxy must be able to reach kubeapi server) in the same file.

This lets us:

  • Track flake rates in Sippy (Tests page → Flakes View)
  • Avoid blocking payload acceptance
  • Continue iterating on detection sensitivity

Once the detection logic is stable and false positives are eliminated, we can remove the flake fallback and let the test hard-fail again.

Context

Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1782395783960919


🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Walkthrough

When a full kube-apiserver outage is detected, the outage evaluator now emits both a failing JUnit case and a passing JUnit case. The corresponding test now checks the updated JUnit count and failure-output shape for success and failure paths.

Changes

Full API outage JUnit handling

Layer / File(s) Summary
Emit failure and success JUnit
pkg/monitortests/network/onpremhaproxy/monitortest.go
evaluateFullAPIOutages returns a failure JUnit plus a success JUnit for full outage detection.
Update outage test assertions
pkg/monitortests/network/onpremhaproxy/monitortest_test.go
TestEvaluateFullAPIOutages checks one JUnit for passing cases and two JUnits for failing cases, including the flake-success marker.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • openshift/origin#31326: Also changes evaluateFullAPIOutages and TestEvaluateFullAPIOutages to adjust the JUnit results emitted for full kube-apiserver outage handling.

Suggested labels

verified-later

Suggested reviewers

  • p0lyn0mial
  • sjenning
🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo titles were added or changed; the only test names are static table-driven strings, and junit names remain a fixed constant.
Test Structure And Quality ✅ Passed The changed test is a pure table-driven unit test with clear assertion messages, no cluster setup/waits, and it follows the package’s existing testing style.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; this is monitor/unit-test logic, and the monitor already skips clusters without infrastructure config (e.g. MicroShift).
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; changes only adjust junit output in a monitor/unit-test helper and its unit tests, with no SNO topology assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed Only monitor test/JUnit logic changed; no deployment, operator, controller, or topology-aware scheduling constraints were added.
Ote Binary Stdout Contract ✅ Passed No main/TestMain/init/RunSpecs or stdout-printing calls exist in the touched package; changes only adjust junit/test logic, not process-level stdout writes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR only changes monitor-test JUnit handling and unit tests; no new Ginkgo e2e tests, IPv4 literals, or external/public connectivity assumptions found.
No-Weak-Crypto ✅ Passed The PR only changes JUnit return behavior and tests; the touched files contain no weak crypto primitives, custom crypto, or secret comparisons.
Container-Privileges ✅ Passed PR only changes Go monitor-test logic and tests; no container manifests or privilege settings (privileged, hostPID/Network/IPC, SYS_ADMIN, allowPrivilegeEscalation) were added.
No-Sensitive-Data-In-Logs ✅ Passed No new logging was added; the PR only adjusts JUnit return values and test assertions, with outputs limited to test names, node names, and timestamps.
Title check ✅ Passed The title accurately summarizes the main change: treating the all-apiservers-down haproxy test as a flake.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@mkowalski

Copy link
Copy Markdown
Contributor Author

/approve

@mkowalski mkowalski changed the title OCPBUGS-9209: monitor/haproxy: treat all-apiservers-down test as flake NO-ISSUE: monitor/haproxy: treat all-apiservers-down test as flake Jun 26, 2026
@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 26, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mkowalski: This pull request explicitly references no jira issue.

Details

In response to this:

What

The haproxy Haproxy should not encounter all kube apiservers down simultaneously monitor test was recently enabled to actually fail (previously it always passed), but its detection logic is still being tuned. In the meantime it blocks payload acceptance for pre-existing conditions that need to be slowly improved.

Why

5.0 nightly payloads have been blocked by this test failing on metal-ipi and vsphere jobs. A sensitivity fix was merged in #31326 (tolerate install-phase bounces), but dgoodwin suggested we also add a flake fallback so the test can't hard-block payloads while we validate and iterate.

How

Return both a failure and a success junit when the test fails. The CI system treats a test that both fails and passes as a flake rather than a hard failure. This is the same pattern already used by the sibling test (Haproxy must be able to reach kubeapi server) in the same file.

This lets us:

  • Track flake rates in Sippy (Tests page → Flakes View)
  • Avoid blocking payload acceptance
  • Continue iterating on detection sensitivity

Once the detection logic is stable and false positives are eliminated, we can remove the flake fallback and let the test hard-fail again.

Context

Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1782395783960919


🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

  • Bug Fixes
  • Full API outage detection now reports both an alerting result and a non-blocking success marker, reducing the chance that transient outage signals block normal progress.
  • Improved result handling so passing cases are validated more strictly and outage cases are tracked more consistently.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mkowalski

Copy link
Copy Markdown
Contributor Author

/verified later @mkowalski

@openshift-ci-robot openshift-ci-robot added verified-later verified Signifies that the PR passed pre-merge verification criteria labels Jun 26, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@mkowalski: This PR has been marked to be verified later by @mkowalski.

Details

In response to this:

/verified later @mkowalski

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from deads2k and sjenning June 26, 2026 08:49
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/network/onpremhaproxy/monitortest_test.go`:
- Around line 267-277: The flake-marker assertions in the monitortest need to
match the downstream contract, not just the junit count and FailureOutput.
Update the checks around the junits returned in the failure path so the passing
junit is verified to have the same Name as the failing junit and a nil
SkipMessage, using the existing junits slice in monitortest_test.go to ensure
countRealFailures will classify it as a flake.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 00dad4f9-aa9e-478f-b4f8-0cf844a1840e

📥 Commits

Reviewing files that changed from the base of the PR and between 817fa8a and dd88e86.

📒 Files selected for processing (2)
  • pkg/monitortests/network/onpremhaproxy/monitortest.go
  • pkg/monitortests/network/onpremhaproxy/monitortest_test.go

Comment thread pkg/monitortests/network/onpremhaproxy/monitortest_test.go
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2026
@petr-muller

Copy link
Copy Markdown
Member

/retest-required

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [mkowalski,petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 26, 2026
@mkowalski

Copy link
Copy Markdown
Contributor Author

/override ci/prow/e2e-vsphere-ovn
/override ci/prow/e2e-vsphere-ovn-upi
/override ci/prow/e2e-aws-ovn-fips
/override ci/prow/e2e-gcp-csi

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@mkowalski: Overrode contexts on behalf of mkowalski: ci/prow/e2e-aws-ovn-fips, ci/prow/e2e-gcp-csi, ci/prow/e2e-vsphere-ovn, ci/prow/e2e-vsphere-ovn-upi

Details

In response to this:

/override ci/prow/e2e-vsphere-ovn
/override ci/prow/e2e-vsphere-ovn-upi
/override ci/prow/e2e-aws-ovn-fips
/override ci/prow/e2e-gcp-csi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@mkowalski: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 1f971e0 into openshift:main Jun 26, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review verified Signifies that the PR passed pre-merge verification criteria verified-later

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants