OCPBUGS-64729: Update etcd alerts to match observed real world data #1511

dgoodwin · 2025-11-06T18:17:21Z

Build on #1495, justification in this comment: https://issues.redhat.com/browse/OCPBUGS-64729?focusedId=28407511&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28407511

coderabbitai · 2025-11-06T18:18:32Z

Walkthrough

etcd alert rules were changed: commit-duration alert expr threshold lowered (0.5 → 0.08) and a new critical commit alert (>0.10) added; fsync alerts were replaced by new warning and critical rules (0.05 and 0.07) and old fsync alerts removed; jsonnet config and dependency lock updated; main.jsonnet excludes an alert name.

Changes

Cohort / File(s)	Summary
Jsonnet alert definitions `jsonnet/custom.libsonnet`	Modified existing `etcdHighCommitDurations` expr threshold (0.5 → 0.08); added new `etcdHighCommitDurations` variant (>0.10, critical); added `etcdHighFsyncDurations` rules (warning >0.05 and critical >0.07) with annotations and runbook URLs.
Jsonnet entrypoint / exclusions `jsonnet/main.jsonnet`	Added `etctdHighFsyncDurations` to the `excludedAlerts` list (note: name appears to contain a typo).
Dependency lock `jsonnet/jsonnetfile.lock.json`	Bumped etcd mixin dependency version/hash from `e4d6a05...` to `c9184ab...` (sum unchanged).
PrometheusRule manifest `manifests/0000_90_etcd-operator_03_prometheusrule.yaml`	Removed two old `etcdHighFsyncDurations` alerts; updated `etcdHighCommitDurations` threshold (0.5 → 0.08); added new critical commit alert (>0.10) and fsync alerts (warning >0.05, critical >0.07); added runbook_url to commit alert.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Verify threshold alignment across custom.libsonnet and the generated manifest (ensure intended values: 0.08, 0.10, 0.05, 0.07).
Confirm the exclusion entry etctdHighFsyncDurations in main.jsonnet is not a typo for etcdHighFsyncDurations.
Check for unintended duplicate or overlapping alerts after replacing/removing old fsync rules.
Review the etcd mixin version bump for any collateral changes to alert naming/labels that may affect filtering.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2025-11-06T18:18:36Z

@dgoodwin: This pull request references Jira Issue OCPBUGS-64729, which is invalid:

expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

NO-JIRA: add recommended etcd threshold for alerts

Update values for observed fleet and CI data

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dgoodwin · 2025-11-06T18:24:11Z

/jira refresh

openshift-ci-robot · 2025-11-06T18:24:23Z

@dgoodwin: This pull request references Jira Issue OCPBUGS-64729, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

simonpasquier · 2025-11-07T08:48:30Z

manifests/0000_90_etcd-operator_03_prometheusrule.yaml

-        > 0.5
-      for: 10m
-      labels:
-        severity: warning


IIUC we're removing the warning severity for etcdHighFsyncDurations. Do we have another rule which can notify platform admins before the critical alert first?

It's a little tough for this, we don't actually know when a cluster falls over. Just that upstream recommendations are optimistic and we have thousands of clusters running much higher. I'm estimating what level of chaos we're willing to cause to lower these down to sensible levels again with the 5% alerting rate, I could trim some off the recommendations here and call that a warning threshold, but then we might be over our 5% fleet rate.

While we don't provide stability guarantees for alerting rules, I presume that some cluster admins will be puzzled by the removal of the warning severity. As stated in https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#warning-alerts warning alerts don't require immediate action but they help identifying potential issues. We could use a higher for value to avoid the alerting rule triggering too often.

cc @typeid

Ok how about I use these limits currently in the pr for warning, and add a critical level a little higher

dgoodwin · 2025-11-07T13:19:18Z

Updated with an attempt at warning plus critical levels.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

jsonnet/custom.libsonnet (1)
63-85: Add the runbook URL in the Jsonnet source as well

The generated manifest now exposes runbook_url for both warning/critical etcdHighCommitDurations, but the Jsonnet definition still omits it. Any other consumers rendering from custom.libsonnet will miss the runbook link, leading to divergence between bundles. Please add the same runbook_url entry to both alert blocks so downstream renders stay in sync.
           annotations: {
             description: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.',
             summary: 'etcd cluster 99th percentile commit durations are too high.',
+            runbook_url: 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighCommitDurations.md'
           },
         },
@@
           annotations: {
             description: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.',
             summary: 'etcd cluster 99th percentile commit durations are too high.',
+            runbook_url: 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighCommitDurations.md'
           },

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 1d496e6 and 90ac16f.

📒 Files selected for processing (3)

jsonnet/custom.libsonnet (2 hunks)
jsonnet/jsonnetfile.lock.json (1 hunks)
manifests/0000_90_etcd-operator_03_prometheusrule.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

jsonnet/jsonnetfile.lock.json

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

jsonnet/custom.libsonnet
manifests/0000_90_etcd-operator_03_prometheusrule.yaml

tjungblu · 2025-11-11T14:16:50Z

@hasbro17 / @dgoodwin / @simonpasquier I've created an upstream PR for this and all the other stuff we have accumulated over the years in:
etcd-io/etcd#20917

hasbro17 · 2025-12-03T06:42:14Z

/lgtm
/retest-required

openshift-ci · 2025-12-03T06:42:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hasbro17]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-12-03T09:47:50Z

@dgoodwin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/configmap-scale	`90ac16f`	link	false	`/test configmap-scale`
ci/prow/e2e-aws-ovn-serial-2of2	`90ac16f`	link	true	`/test e2e-aws-ovn-serial-2of2`
ci/prow/e2e-aws-ovn-serial-1of2	`90ac16f`	link	true	`/test e2e-aws-ovn-serial-1of2`
ci/prow/e2e-aws-ovn-single-node	`90ac16f`	link	true	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-agnostic-ovn-upgrade	`90ac16f`	link	true	`/test e2e-agnostic-ovn-upgrade`
ci/prow/e2e-metal-ipi-ovn-ipv6	`90ac16f`	link	true	`/test e2e-metal-ipi-ovn-ipv6`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2025-12-12T08:26:59Z

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

openshift-ci-robot · 2025-12-12T08:28:23Z

@openshift-bot: This pull request references Jira Issue OCPBUGS-64729, which is invalid:

expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "4.21.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tjungblu and others added 2 commits October 22, 2025 15:09

NO-JIRA: add recommended etcd threshold for alerts

41821e3

Update values for observed fleet and CI data

1d496e6

openshift-ci bot requested review from dusk125 and ironcladlou November 6, 2025 18:17

dgoodwin changed the title ~~etcd fleet ci thresholds~~ OCPBUGS-64729: Update etcd alerts to match observed real world data Nov 6, 2025

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 6, 2025

dgoodwin mentioned this pull request Nov 6, 2025

NO-JIRA: add recommended etcd threshold for alerts #1495

Open

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 6, 2025

simonpasquier reviewed Nov 7, 2025

View reviewed changes

Reinstate warning levels for commit duration and wal fsync alerts

90ac16f

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

openshift-ci bot assigned hasbro17 Dec 3, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 3, 2025

openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Dec 12, 2025

OCPBUGS-64729: Update etcd alerts to match observed real world data #1511

Are you sure you want to change the base?

OCPBUGS-64729: Update etcd alerts to match observed real world data #1511

Conversation

dgoodwin commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci-robot commented Nov 6, 2025

Uh oh!

dgoodwin commented Nov 6, 2025

Uh oh!

openshift-ci-robot commented Nov 6, 2025

Uh oh!

simonpasquier Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

dgoodwin Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

simonpasquier Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

dgoodwin Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

dgoodwin commented Nov 7, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tjungblu commented Nov 11, 2025

Uh oh!

hasbro17 commented Dec 3, 2025

Uh oh!

openshift-ci bot commented Dec 3, 2025

Uh oh!

openshift-ci bot commented Dec 3, 2025

Uh oh!

openshift-bot commented Dec 12, 2025

Uh oh!

openshift-ci-robot commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dgoodwin commented Nov 6, 2025 •

edited

Loading

coderabbitai bot commented Nov 6, 2025 •

edited

Loading