Skip to content

Conversation

@dgoodwin
Copy link
Contributor

@dgoodwin dgoodwin commented Nov 6, 2025

@openshift-ci openshift-ci bot requested review from dusk125 and ironcladlou November 6, 2025 18:17
@dgoodwin dgoodwin changed the title etcd fleet ci thresholds OCPBUGS-64729: Update etcd alerts to match observed real world data Nov 6, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 6, 2025

Walkthrough

etcd alert rules were changed: commit-duration alert expr threshold lowered (0.5 → 0.08) and a new critical commit alert (>0.10) added; fsync alerts were replaced by new warning and critical rules (0.05 and 0.07) and old fsync alerts removed; jsonnet config and dependency lock updated; main.jsonnet excludes an alert name.

Changes

Cohort / File(s) Summary
Jsonnet alert definitions
jsonnet/custom.libsonnet
Modified existing etcdHighCommitDurations expr threshold (0.5 → 0.08); added new etcdHighCommitDurations variant (>0.10, critical); added etcdHighFsyncDurations rules (warning >0.05 and critical >0.07) with annotations and runbook URLs.
Jsonnet entrypoint / exclusions
jsonnet/main.jsonnet
Added etctdHighFsyncDurations to the excludedAlerts list (note: name appears to contain a typo).
Dependency lock
jsonnet/jsonnetfile.lock.json
Bumped etcd mixin dependency version/hash from e4d6a05... to c9184ab... (sum unchanged).
PrometheusRule manifest
manifests/0000_90_etcd-operator_03_prometheusrule.yaml
Removed two old etcdHighFsyncDurations alerts; updated etcdHighCommitDurations threshold (0.5 → 0.08); added new critical commit alert (>0.10) and fsync alerts (warning >0.05, critical >0.07); added runbook_url to commit alert.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Verify threshold alignment across custom.libsonnet and the generated manifest (ensure intended values: 0.08, 0.10, 0.05, 0.07).
  • Confirm the exclusion entry etctdHighFsyncDurations in main.jsonnet is not a typo for etcdHighFsyncDurations.
  • Check for unintended duplicate or overlapping alerts after replacing/removing old fsync rules.
  • Review the etcd mixin version bump for any collateral changes to alert naming/labels that may affect filtering.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 6, 2025
@openshift-ci-robot
Copy link

@dgoodwin: This pull request references Jira Issue OCPBUGS-64729, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

  • NO-JIRA: add recommended etcd threshold for alerts
  • Update values for observed fleet and CI data

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dgoodwin
Copy link
Contributor Author

dgoodwin commented Nov 6, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 6, 2025
@openshift-ci-robot
Copy link

@dgoodwin: This pull request references Jira Issue OCPBUGS-64729, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

> 0.5
for: 10m
labels:
severity: warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we're removing the warning severity for etcdHighFsyncDurations. Do we have another rule which can notify platform admins before the critical alert first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little tough for this, we don't actually know when a cluster falls over. Just that upstream recommendations are optimistic and we have thousands of clusters running much higher. I'm estimating what level of chaos we're willing to cause to lower these down to sensible levels again with the 5% alerting rate, I could trim some off the recommendations here and call that a warning threshold, but then we might be over our 5% fleet rate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we don't provide stability guarantees for alerting rules, I presume that some cluster admins will be puzzled by the removal of the warning severity. As stated in https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#warning-alerts warning alerts don't require immediate action but they help identifying potential issues. We could use a higher for value to avoid the alerting rule triggering too often.

cc @typeid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok how about I use these limits currently in the pr for warning, and add a critical level a little higher

@dgoodwin
Copy link
Contributor Author

dgoodwin commented Nov 7, 2025

Updated with an attempt at warning plus critical levels.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
jsonnet/custom.libsonnet (1)

63-85: Add the runbook URL in the Jsonnet source as well

The generated manifest now exposes runbook_url for both warning/critical etcdHighCommitDurations, but the Jsonnet definition still omits it. Any other consumers rendering from custom.libsonnet will miss the runbook link, leading to divergence between bundles. Please add the same runbook_url entry to both alert blocks so downstream renders stay in sync.

           annotations: {
             description: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.',
             summary: 'etcd cluster 99th percentile commit durations are too high.',
+            runbook_url: 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighCommitDurations.md'
           },
         },
@@
           annotations: {
             description: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.',
             summary: 'etcd cluster 99th percentile commit durations are too high.',
+            runbook_url: 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighCommitDurations.md'
           },
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 1d496e6 and 90ac16f.

📒 Files selected for processing (3)
  • jsonnet/custom.libsonnet (2 hunks)
  • jsonnet/jsonnetfile.lock.json (1 hunks)
  • manifests/0000_90_etcd-operator_03_prometheusrule.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • jsonnet/jsonnetfile.lock.json
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • jsonnet/custom.libsonnet
  • manifests/0000_90_etcd-operator_03_prometheusrule.yaml

@tjungblu
Copy link
Contributor

@hasbro17 / @dgoodwin / @simonpasquier I've created an upstream PR for this and all the other stuff we have accumulated over the years in:
etcd-io/etcd#20917

@hasbro17
Copy link
Contributor

hasbro17 commented Dec 3, 2025

/lgtm
/retest-required

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 3, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 3, 2025

@dgoodwin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/configmap-scale 90ac16f link false /test configmap-scale
ci/prow/e2e-aws-ovn-serial-2of2 90ac16f link true /test e2e-aws-ovn-serial-2of2
ci/prow/e2e-aws-ovn-serial-1of2 90ac16f link true /test e2e-aws-ovn-serial-1of2
ci/prow/e2e-aws-ovn-single-node 90ac16f link true /test e2e-aws-ovn-single-node
ci/prow/e2e-agnostic-ovn-upgrade 90ac16f link true /test e2e-agnostic-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6 90ac16f link true /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Dec 12, 2025
@openshift-ci-robot
Copy link

@openshift-bot: This pull request references Jira Issue OCPBUGS-64729, which is invalid:

  • expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "4.21.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants