Skip to content

Tune API probes#1074

Merged
openshift-merge-bot[bot] merged 2 commits intoopenstack-k8s-operators:mainfrom
gibizer:tune-probes
Mar 12, 2026
Merged

Tune API probes#1074
openshift-merge-bot[bot] merged 2 commits intoopenstack-k8s-operators:mainfrom
gibizer:tune-probes

Conversation

@gibizer
Copy link
Copy Markdown
Contributor

@gibizer gibizer commented Feb 18, 2026

While investigated API scaling issues we noticed that our hard code
probe configuration is not optimal for scaling nova-api. Instead of
immediately killing pods when they are not responding in 30 seconds we
should be removing pods from the load balancer first when they are
getting overloaded and let them work through their backlog and only kill
a pod if it is hanging for an excessive amount of time.

Another observation was that we allow configuring APITimeout parameter
on our routes but changing that value is not reflected in our probe
configs. So even if the customer decides that it is OK if nova-api is
responding slower by increasing the APITimeout, our probes does not
become more forgiving.

This patch changes the probe configuration of nova-api and nova-metadata
to:

  • be quick to remove the pod from the load balancer if it is overloaded
    via the readiness probe config
  • be very forgiving about slow responses and only killing the pod if it
    is hanging for a long time via the liveness probe.
  • both readiness and liveness probe timeout is now scaling with the
    APITimeout configuration.

The apache log config also modified to log the elapsed time service a request.

Jira: OSPRH-25717
Jira: OSPRH-27192

@gibizer
Copy link
Copy Markdown
Contributor Author

gibizer commented Feb 18, 2026

/hold on TODOs until this fix gets scheduled to be completed in an upcoming sprint

@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/c80f40c7b9974374baf045dd909b57e9

✔️ openstack-meta-content-provider SUCCESS in 1h 05m 59s
✔️ nova-operator-kuttl SUCCESS in 47m 29s
nova-operator-tempest-multinode FAILURE in 31m 34s
nova-operator-tempest-multinode-ceph FAILURE in 33m 53s

Copy link
Copy Markdown
Contributor

@bogdando bogdando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support the approach, although I would use larger (x3-x10) values for liveness compared to readiness

@gibizer
Copy link
Copy Markdown
Contributor Author

gibizer commented Mar 6, 2026

/unhold

@gibizer
Copy link
Copy Markdown
Contributor Author

gibizer commented Mar 11, 2026

/hold I'm doing a comparison testing with and without this patch in the same env and I see things that might need further refinement here.

@gibizer
Copy link
Copy Markdown
Contributor Author

gibizer commented Mar 11, 2026

/hold I'm doing a comparison testing with and without this patch in the same env and I see things that might need further refinement here.

I will push a new PS with different numbers. While the current ones effectively prevent overload and pod kill the readiness probe is too sensitive and moves the pod way early out of the LB and hence causes unnecessary reactions. We can be slower to react and still prevent the kills. I will have comparable numbers when I push the new version

@gibizer
Copy link
Copy Markdown
Contributor Author

gibizer commented Mar 11, 2026

Measurements

  • CRC, 20 CPU, 32G RAM
  • 10 EDPM compute node VM
  • 80 OpenStack VM spread across the compute nodes
  • 3 replicas from nova-api, nova-conductor, galera, rabbit, neutron. The rest is the install_yamls defaults
  • Querying the openstack server list via the API request GET https://nova-public-openstack.apps-crc.testing/v2.1/servers/detail
  • load was generated with wrk https://github.com/wg/wrk e.g. $ wrk -c 25 -d 10m -t 10 -H "X-Auth-Token: <token>" --latency --timeout 70 https://nova-public-openstack.apps-crc.testing/v2.1/servers/detail Where -c 25 means 25 parallel connections across -t 10 10 threads, for -d 10m 10 minutes. The 10 minutes run and the --timeout 70 kept for all runs. The 70 seconds timeout is selected as nova-api had a 60 seconds (default) apiTimeout configured in the control plane.
load result with main result with this PR
c25, t10 2.3 req/sec no kill 2.36 req/sec, no kill
c50, t20 2.2 req/sec no kill 2.21 req/sec, no kill
c60, t15 2.1 req/sec one api kill 2.22 req/sec, no kill, 5 req error
c75, t20 many api kill 2.13 req/sec, no kill, 30 req error
c100, t20 many api + galera kill 2.33 req/sec, galera kill, 360 req error

@gibizer
Copy link
Copy Markdown
Contributor Author

gibizer commented Mar 11, 2026

/unhold

gibizer added 2 commits March 12, 2026 09:04
This will help to easily see how long certain API requests take in
highly loaded system.

Signed-off-by: Balazs Gibizer <gibi@redhat.com>
While investigated API scaling issues we noticed that our hard code
probe configuration is not optimal for scaling nova-api. Instead of
immediately killing pods when they are not responding in 30 seconds we
should be removing pods from the load balancer first when they are
getting overloaded and let them work through their backlog and only kill
a pod if it is hanging for an excessive amount of time.

Another observation was that we allow configuring APITimeout parameter
on our routes but changing that value is not reflected in our probe
configs. So even if the customer decides that it is OK if nova-api is
responding slower by increasing the APITimeout, our probes does not
become more forgiving.

This patch changes the probe configuration of nova-api and nova-metadata
to:
* be quick to remove the pod from the load balancer if it is overloaded
  via the readiness probe config
* be very forgiving about slow responses and only killing the pod if it
  is hanging for a long time via the liveness probe.
* both readiness and liveness probe timeout is now scaling with the
  APITimeout configuration.

Jira: OSPRH-25717
Jira: OSPRH-27192

Signed-off-by: Balazs Gibizer <gibi@redhat.com>
Copy link
Copy Markdown
Contributor

@mrkisaolamb mrkisaolamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gibizer, mrkisaolamb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [gibizer,mrkisaolamb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bogdando
Copy link
Copy Markdown
Contributor

I support the approach, although I would use larger (x3-x10) values for liveness compared to readiness

The current edition only provides x1.67 for liveness compared to readiness. The testing results look solid enough though, so LGTM

@openshift-merge-bot openshift-merge-bot bot merged commit ed0bef5 into openstack-k8s-operators:main Mar 12, 2026
7 checks passed
@gibizer
Copy link
Copy Markdown
Contributor Author

gibizer commented Mar 12, 2026

I support the approach, although I would use larger (x3-x10) values for liveness compared to readiness

The current edition only provides x1.67 for liveness compared to readiness. The testing results look solid enough though, so LGTM

This is all a matter of balance. Creating a bigger gap between readiness and liveness can be done in two ways:

  • trigger readiness failure earlier. This will cause throughput issues as the pod would be moved out of the load balancer sooner, and would be moved back and forth a lot. Such move has overhead on k8s and limits the overall api throughput
  • tritgger the liveness failure later. The current liveness probe fails (with the default 60 seconds apiTimeout config) after 10x30 seconds = 5 minutes. That is a 5 times longer than then accepted apiTimeout. Moving this further out means a hanging pod will be detected later. We can do it but I would not like to do it by default. There is parallel effort to make all the probe configuration overwrittable by the user via the CRD. So I suggest to keep the current proposal as default and if needed after the CRD change specific customers can gut the liveness probe further.

@gibizer
Copy link
Copy Markdown
Contributor Author

gibizer commented Mar 13, 2026

/cherry-pick 18.0-fr5

@openshift-cherrypick-robot
Copy link
Copy Markdown

@gibizer: new pull request created: #1086

Details

In response to this:

/cherry-pick 18.0-fr5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants