NO-ISSUE: monitor/haproxy: treat all-apiservers-down test as flake by mkowalski · Pull Request #31343 · openshift/origin

mkowalski · 2026-06-26T08:45:19Z

What

The haproxy Haproxy should not encounter all kube apiservers down simultaneously monitor test was recently enabled to actually fail (previously it always passed), but its detection logic is still being tuned. In the meantime it blocks payload acceptance for pre-existing conditions that need to be slowly improved.

Why

5.0 nightly payloads have been blocked by this test failing on metal-ipi and vsphere jobs. A sensitivity fix was merged in #31326 (tolerate install-phase bounces), but dgoodwin suggested we also add a flake fallback so the test can't hard-block payloads while we validate and iterate.

How

Return both a failure and a success junit when the test fails. The CI system treats a test that both fails and passes as a flake rather than a hard failure. This is the same pattern already used by the sibling test (Haproxy must be able to reach kubeapi server) in the same file.

This lets us:

Track flake rates in Sippy (Tests page → Flakes View)
Avoid blocking payload acceptance
Continue iterating on detection sensitivity

Once the detection logic is stable and false positives are eliminated, we can remove the flake fallback and let the test hard-fail again.

Context

Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1782395783960919

🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

Bug Fixes
- Full API outage detection now reports both an alerting result and a non-blocking success marker, reducing the chance that transient outage signals block normal progress.
- Improved result handling so passing cases are validated more strictly and outage cases are tracked more consistently.

The haproxy all-apiservers-down monitor test was recently enabled to actually fail, but its detection logic is still being tuned. In the meantime it blocks payload acceptance for pre-existing conditions. Return both a failure and a success junit when the test fails, which the CI system treats as a flake. This lets us track flake rates in Sippy without blocking nightly payloads while we refine the detection sensitivity. Signed-off-by: Mateusz Kowalski <mko@redhat.com> Generated-by: AI Signed-off-by: Mateusz Kowalski <mko@redhat.com>

openshift-merge-bot · 2026-06-26T08:45:23Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot · 2026-06-26T08:45:26Z

@mkowalski: This pull request references Jira Issue OCPBUGS-9209, which is invalid:

expected the bug to be open, but it isn't
expected the bug to target the "5.0.0" version, but no target version was set
expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Closed (Done) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What

The haproxy Haproxy should not encounter all kube apiservers down simultaneously monitor test was recently enabled to actually fail (previously it always passed), but its detection logic is still being tuned. In the meantime it blocks payload acceptance for pre-existing conditions that need to be slowly improved.

Why

5.0 nightly payloads have been blocked by this test failing on metal-ipi and vsphere jobs. A sensitivity fix was merged in #31326 (tolerate install-phase bounces), but dgoodwin suggested we also add a flake fallback so the test can't hard-block payloads while we validate and iterate.

How

Return both a failure and a success junit when the test fails. The CI system treats a test that both fails and passes as a flake rather than a hard failure. This is the same pattern already used by the sibling test (Haproxy must be able to reach kubeapi server) in the same file.

This lets us:

Track flake rates in Sippy (Tests page → Flakes View)

Avoid blocking payload acceptance

Continue iterating on detection sensitivity

Once the detection logic is stable and false positives are eliminated, we can remove the flake fallback and let the test hard-fail again.

Context

Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1782395783960919

🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-26T08:46:10Z

Walkthrough

When a full kube-apiserver outage is detected, the outage evaluator now emits both a failing JUnit case and a passing JUnit case. The corresponding test now checks the updated JUnit count and failure-output shape for success and failure paths.

Changes

Full API outage JUnit handling

Layer / File(s)	Summary
Emit failure and success JUnit `pkg/monitortests/network/onpremhaproxy/monitortest.go`	`evaluateFullAPIOutages` returns a failure JUnit plus a success JUnit for full outage detection.
Update outage test assertions `pkg/monitortests/network/onpremhaproxy/monitortest_test.go`	`TestEvaluateFullAPIOutages` checks one JUnit for passing cases and two JUnits for failing cases, including the flake-success marker.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

openshift/origin#31326: Also changes evaluateFullAPIOutages and TestEvaluateFullAPIOutages to adjust the JUnit results emitted for full kube-apiserver outage handling.

Suggested labels

verified-later

Suggested reviewers

p0lyn0mial
sjenning

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (14 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	No Ginkgo titles were added or changed; the only test names are static table-driven strings, and junit names remain a fixed constant.
Test Structure And Quality	✅ Passed	The changed test is a pure table-driven unit test with clear assertion messages, no cluster setup/waits, and it follows the package’s existing testing style.
Microshift Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added; this is monitor/unit-test logic, and the monitor already skips clusters without infrastructure config (e.g. MicroShift).
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added; changes only adjust junit output in a monitor/unit-test helper and its unit tests, with no SNO topology assumptions.
Topology-Aware Scheduling Compatibility	✅ Passed	Only monitor test/JUnit logic changed; no deployment, operator, controller, or topology-aware scheduling constraints were added.
Ote Binary Stdout Contract	✅ Passed	No main/TestMain/init/RunSpecs or stdout-printing calls exist in the touched package; changes only adjust junit/test logic, not process-level stdout writes.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR only changes monitor-test JUnit handling and unit tests; no new Ginkgo e2e tests, IPv4 literals, or external/public connectivity assumptions found.
No-Weak-Crypto	✅ Passed	The PR only changes JUnit return behavior and tests; the touched files contain no weak crypto primitives, custom crypto, or secret comparisons.
Container-Privileges	✅ Passed	PR only changes Go monitor-test logic and tests; no container manifests or privilege settings (privileged, hostPID/Network/IPC, SYS_ADMIN, allowPrivilegeEscalation) were added.
No-Sensitive-Data-In-Logs	✅ Passed	No new logging was added; the PR only adjusts JUnit return values and test assertions, with outputs limited to test names, node names, and timestamps.
Title check	✅ Passed	The title accurately summarizes the main change: treating the all-apiservers-down haproxy test as a flake.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

mkowalski · 2026-06-26T08:47:35Z

/approve

openshift-ci-robot · 2026-06-26T08:48:04Z

@mkowalski: This pull request explicitly references no jira issue.

Details

In response to this:

What

The haproxy Haproxy should not encounter all kube apiservers down simultaneously monitor test was recently enabled to actually fail (previously it always passed), but its detection logic is still being tuned. In the meantime it blocks payload acceptance for pre-existing conditions that need to be slowly improved.

Why

5.0 nightly payloads have been blocked by this test failing on metal-ipi and vsphere jobs. A sensitivity fix was merged in #31326 (tolerate install-phase bounces), but dgoodwin suggested we also add a flake fallback so the test can't hard-block payloads while we validate and iterate.

How

Return both a failure and a success junit when the test fails. The CI system treats a test that both fails and passes as a flake rather than a hard failure. This is the same pattern already used by the sibling test (Haproxy must be able to reach kubeapi server) in the same file.

This lets us:

Track flake rates in Sippy (Tests page → Flakes View)

Avoid blocking payload acceptance

Continue iterating on detection sensitivity

Once the detection logic is stable and false positives are eliminated, we can remove the flake fallback and let the test hard-fail again.

Context

Slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1782395783960919

🤖 This PR was generated by AI on behalf of @mkowalski, who has reviewed it.

Summary by CodeRabbit

Bug Fixes

Full API outage detection now reports both an alerting result and a non-blocking success marker, reducing the chance that transient outage signals block normal progress.

Improved result handling so passing cases are validated more strictly and outage cases are tracked more consistently.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

mkowalski · 2026-06-26T08:48:43Z

/verified later @mkowalski

openshift-ci-robot · 2026-06-26T08:48:58Z

@mkowalski: This PR has been marked to be verified later by @mkowalski.

Details

In response to this:

/verified later @mkowalski

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/network/onpremhaproxy/monitortest_test.go`:
- Around line 267-277: The flake-marker assertions in the monitortest need to
match the downstream contract, not just the junit count and FailureOutput.
Update the checks around the junits returned in the failure path so the passing
junit is verified to have the same Name as the failing junit and a nil
SkipMessage, using the existing junits slice in monitortest_test.go to ensure
countRealFailures will classify it as a flake.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 00dad4f9-aa9e-478f-b4f8-0cf844a1840e

📥 Commits

Reviewing files that changed from the base of the PR and between 817fa8a and dd88e86.

📒 Files selected for processing (2)

pkg/monitortests/network/onpremhaproxy/monitortest.go
pkg/monitortests/network/onpremhaproxy/monitortest_test.go

petr-muller · 2026-06-26T10:16:33Z

/retest-required

openshift-ci · 2026-06-26T10:16:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mkowalski,petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-merge-bot · 2026-06-26T10:47:31Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

mkowalski · 2026-06-26T14:52:32Z

/override ci/prow/e2e-vsphere-ovn
/override ci/prow/e2e-vsphere-ovn-upi
/override ci/prow/e2e-aws-ovn-fips
/override ci/prow/e2e-gcp-csi

openshift-ci · 2026-06-26T14:52:48Z

@mkowalski: Overrode contexts on behalf of mkowalski: ci/prow/e2e-aws-ovn-fips, ci/prow/e2e-gcp-csi, ci/prow/e2e-vsphere-ovn, ci/prow/e2e-vsphere-ovn-upi

Details

In response to this:

/override ci/prow/e2e-vsphere-ovn
/override ci/prow/e2e-vsphere-ovn-upi
/override ci/prow/e2e-aws-ovn-fips
/override ci/prow/e2e-gcp-csi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-06-26T14:52:57Z

@mkowalski: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 26, 2026

mkowalski changed the title ~~OCPBUGS-9209: monitor/haproxy: treat all-apiservers-down test as flake~~ NO-ISSUE: monitor/haproxy: treat all-apiservers-down test as flake Jun 26, 2026

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 26, 2026

openshift-ci-robot added verified-later verified Signifies that the PR passed pre-merge verification criteria labels Jun 26, 2026

openshift-ci Bot requested review from deads2k and sjenning June 26, 2026 08:49

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2026

coderabbitai Bot requested changes Jun 26, 2026

View reviewed changes

Comment thread pkg/monitortests/network/onpremhaproxy/monitortest_test.go

petr-muller approved these changes Jun 26, 2026

View reviewed changes

openshift-ci Bot assigned petr-muller Jun 26, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 26, 2026

coderabbitai Bot approved these changes Jun 26, 2026

View reviewed changes

openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 26, 2026

openshift-merge-bot Bot merged commit 1f971e0 into openshift:main Jun 26, 2026
21 checks passed

Uh oh!

Conversation

mkowalski commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Context

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Jun 26, 2026

Uh oh!

openshift-ci-robot commented Jun 26, 2026

What

Why

How

Context

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

mkowalski commented Jun 26, 2026

Uh oh!

openshift-ci-robot commented Jun 26, 2026

What

Why

How

Context

Summary by CodeRabbit

Uh oh!

mkowalski commented Jun 26, 2026

Uh oh!

openshift-ci-robot commented Jun 26, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

petr-muller commented Jun 26, 2026

Uh oh!

openshift-ci Bot commented Jun 26, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 26, 2026

Uh oh!

mkowalski commented Jun 26, 2026

Uh oh!

openshift-ci Bot commented Jun 26, 2026

Uh oh!

openshift-ci Bot commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mkowalski commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading