-
Notifications
You must be signed in to change notification settings - Fork 148
OCPBUGS-64729: Update etcd alerts to match observed real world data #1511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Walkthroughetcd alert rules were changed: commit-duration alert expr threshold lowered (0.5 → 0.08) and a new critical commit alert (>0.10) added; fsync alerts were replaced by new warning and critical rules (0.05 and 0.07) and old fsync alerts removed; jsonnet config and dependency lock updated; main.jsonnet excludes an alert name. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
|
@dgoodwin: This pull request references Jira Issue OCPBUGS-64729, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@dgoodwin: This pull request references Jira Issue OCPBUGS-64729, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| > 0.5 | ||
| for: 10m | ||
| labels: | ||
| severity: warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we're removing the warning severity for etcdHighFsyncDurations. Do we have another rule which can notify platform admins before the critical alert first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little tough for this, we don't actually know when a cluster falls over. Just that upstream recommendations are optimistic and we have thousands of clusters running much higher. I'm estimating what level of chaos we're willing to cause to lower these down to sensible levels again with the 5% alerting rate, I could trim some off the recommendations here and call that a warning threshold, but then we might be over our 5% fleet rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we don't provide stability guarantees for alerting rules, I presume that some cluster admins will be puzzled by the removal of the warning severity. As stated in https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#warning-alerts warning alerts don't require immediate action but they help identifying potential issues. We could use a higher for value to avoid the alerting rule triggering too often.
cc @typeid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok how about I use these limits currently in the pr for warning, and add a critical level a little higher
|
Updated with an attempt at warning plus critical levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
jsonnet/custom.libsonnet (1)
63-85: Add the runbook URL in the Jsonnet source as wellThe generated manifest now exposes
runbook_urlfor both warning/criticaletcdHighCommitDurations, but the Jsonnet definition still omits it. Any other consumers rendering fromcustom.libsonnetwill miss the runbook link, leading to divergence between bundles. Please add the samerunbook_urlentry to both alert blocks so downstream renders stay in sync.annotations: { description: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.', summary: 'etcd cluster 99th percentile commit durations are too high.', + runbook_url: 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighCommitDurations.md' }, }, @@ annotations: { description: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.', summary: 'etcd cluster 99th percentile commit durations are too high.', + runbook_url: 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighCommitDurations.md' },
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (3)
jsonnet/custom.libsonnet(2 hunks)jsonnet/jsonnetfile.lock.json(1 hunks)manifests/0000_90_etcd-operator_03_prometheusrule.yaml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- jsonnet/jsonnetfile.lock.json
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
jsonnet/custom.libsonnetmanifests/0000_90_etcd-operator_03_prometheusrule.yaml
|
@hasbro17 / @dgoodwin / @simonpasquier I've created an upstream PR for this and all the other stuff we have accumulated over the years in: |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, hasbro17 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@dgoodwin: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/jira refresh The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity. |
|
@openshift-bot: This pull request references Jira Issue OCPBUGS-64729, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Build on #1495, justification in this comment: https://issues.redhat.com/browse/OCPBUGS-64729?focusedId=28407511&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-28407511