Tune API probes by gibizer · Pull Request #1074 · openstack-k8s-operators/nova-operator

gibizer · 2026-02-18T12:45:42Z

While investigated API scaling issues we noticed that our hard code
probe configuration is not optimal for scaling nova-api. Instead of
immediately killing pods when they are not responding in 30 seconds we
should be removing pods from the load balancer first when they are
getting overloaded and let them work through their backlog and only kill
a pod if it is hanging for an excessive amount of time.

Another observation was that we allow configuring APITimeout parameter
on our routes but changing that value is not reflected in our probe
configs. So even if the customer decides that it is OK if nova-api is
responding slower by increasing the APITimeout, our probes does not
become more forgiving.

This patch changes the probe configuration of nova-api and nova-metadata
to:

be quick to remove the pod from the load balancer if it is overloaded
via the readiness probe config
be very forgiving about slow responses and only killing the pod if it
is hanging for a long time via the liveness probe.
both readiness and liveness probe timeout is now scaling with the
APITimeout configuration.

The apache log config also modified to log the elapsed time service a request.

Jira: OSPRH-25717
Jira: OSPRH-27192

gibizer · 2026-02-18T12:47:32Z

/hold on TODOs until this fix gets scheduled to be completed in an upcoming sprint

softwarefactory-project-zuul · 2026-02-18T13:52:49Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/c80f40c7b9974374baf045dd909b57e9

✔️ openstack-meta-content-provider SUCCESS in 1h 05m 59s
✔️ nova-operator-kuttl SUCCESS in 47m 29s
❌ nova-operator-tempest-multinode FAILURE in 31m 34s
❌ nova-operator-tempest-multinode-ceph FAILURE in 33m 53s

bogdando

I support the approach, although I would use larger (x3-x10) values for liveness compared to readiness

gibizer · 2026-03-06T13:48:19Z

/unhold

gibizer · 2026-03-11T11:34:04Z

/hold I'm doing a comparison testing with and without this patch in the same env and I see things that might need further refinement here.

gibizer · 2026-03-11T16:32:41Z

/hold I'm doing a comparison testing with and without this patch in the same env and I see things that might need further refinement here.

I will push a new PS with different numbers. While the current ones effectively prevent overload and pod kill the readiness probe is too sensitive and moves the pod way early out of the LB and hence causes unnecessary reactions. We can be slower to react and still prevent the kills. I will have comparable numbers when I push the new version

gibizer · 2026-03-11T17:06:16Z

Measurements

CRC, 20 CPU, 32G RAM
10 EDPM compute node VM
80 OpenStack VM spread across the compute nodes
3 replicas from nova-api, nova-conductor, galera, rabbit, neutron. The rest is the install_yamls defaults
Querying the openstack server list via the API request GET https://nova-public-openstack.apps-crc.testing/v2.1/servers/detail
load was generated with wrk https://github.com/wg/wrk e.g. $ wrk -c 25 -d 10m -t 10 -H "X-Auth-Token: <token>" --latency --timeout 70 https://nova-public-openstack.apps-crc.testing/v2.1/servers/detail Where -c 25 means 25 parallel connections across -t 10 10 threads, for -d 10m 10 minutes. The 10 minutes run and the --timeout 70 kept for all runs. The 70 seconds timeout is selected as nova-api had a 60 seconds (default) apiTimeout configured in the control plane.

load	result with main	result with this PR
c25, t10	2.3 req/sec no kill	2.36 req/sec, no kill
c50, t20	2.2 req/sec no kill	2.21 req/sec, no kill
c60, t15	2.1 req/sec one api kill	2.22 req/sec, no kill, 5 req error
c75, t20	many api kill	2.13 req/sec, no kill, 30 req error
c100, t20	many api + galera kill	2.33 req/sec, galera kill, 360 req error

gibizer · 2026-03-11T17:07:14Z

/unhold

This will help to easily see how long certain API requests take in highly loaded system. Signed-off-by: Balazs Gibizer <gibi@redhat.com>

While investigated API scaling issues we noticed that our hard code probe configuration is not optimal for scaling nova-api. Instead of immediately killing pods when they are not responding in 30 seconds we should be removing pods from the load balancer first when they are getting overloaded and let them work through their backlog and only kill a pod if it is hanging for an excessive amount of time. Another observation was that we allow configuring APITimeout parameter on our routes but changing that value is not reflected in our probe configs. So even if the customer decides that it is OK if nova-api is responding slower by increasing the APITimeout, our probes does not become more forgiving. This patch changes the probe configuration of nova-api and nova-metadata to: * be quick to remove the pod from the load balancer if it is overloaded via the readiness probe config * be very forgiving about slow responses and only killing the pod if it is hanging for a long time via the liveness probe. * both readiness and liveness probe timeout is now scaling with the APITimeout configuration. Jira: OSPRH-25717 Jira: OSPRH-27192 Signed-off-by: Balazs Gibizer <gibi@redhat.com>

mrkisaolamb

lgtm

openshift-ci · 2026-03-12T08:05:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gibizer, mrkisaolamb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [gibizer,mrkisaolamb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bogdando · 2026-03-12T10:56:35Z

I support the approach, although I would use larger (x3-x10) values for liveness compared to readiness

The current edition only provides x1.67 for liveness compared to readiness. The testing results look solid enough though, so LGTM

gibizer · 2026-03-12T11:04:17Z

I support the approach, although I would use larger (x3-x10) values for liveness compared to readiness

The current edition only provides x1.67 for liveness compared to readiness. The testing results look solid enough though, so LGTM

This is all a matter of balance. Creating a bigger gap between readiness and liveness can be done in two ways:

trigger readiness failure earlier. This will cause throughput issues as the pod would be moved out of the load balancer sooner, and would be moved back and forth a lot. Such move has overhead on k8s and limits the overall api throughput
tritgger the liveness failure later. The current liveness probe fails (with the default 60 seconds apiTimeout config) after 10x30 seconds = 5 minutes. That is a 5 times longer than then accepted apiTimeout. Moving this further out means a hanging pod will be detected later. We can do it but I would not like to do it by default. There is parallel effort to make all the probe configuration overwrittable by the user via the CRD. So I suggest to keep the current proposal as default and if needed after the CRD change specific customers can gut the liveness probe further.

gibizer · 2026-03-13T10:23:28Z

/cherry-pick 18.0-fr5

openshift-cherrypick-robot · 2026-03-13T10:24:07Z

@gibizer: new pull request created: #1086

Details

In response to this:

/cherry-pick 18.0-fr5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot requested review from SeanMooney and bogdando February 18, 2026 12:45

openshift-ci bot added the approved label Feb 18, 2026

openshift-ci bot added the do-not-merge/hold label Feb 18, 2026

bogdando reviewed Feb 18, 2026

View reviewed changes

gibizer force-pushed the tune-probes branch from 223ae70 to 21b235d Compare March 6, 2026 13:46

openshift-ci bot removed the do-not-merge/hold label Mar 6, 2026

openshift-ci bot added the do-not-merge/hold label Mar 11, 2026

gibizer force-pushed the tune-probes branch from 21b235d to 4f76ec2 Compare March 11, 2026 16:47

openshift-ci bot removed the do-not-merge/hold label Mar 11, 2026

gibizer mentioned this pull request Mar 11, 2026

Allow defaultConfigOverwrite to replace httpd.conf #1084

Draft

gibizer added 2 commits March 12, 2026 09:04

Add elapsed time to apache request log

b5db602

This will help to easily see how long certain API requests take in highly loaded system. Signed-off-by: Balazs Gibizer <gibi@redhat.com>

mrkisaolamb force-pushed the tune-probes branch from 4f76ec2 to 80e6f0c Compare March 12, 2026 08:04

mrkisaolamb approved these changes Mar 12, 2026

View reviewed changes

openshift-ci bot assigned mrkisaolamb Mar 12, 2026

openshift-ci bot added the lgtm label Mar 12, 2026

openshift-merge-bot bot merged commit ed0bef5 into openstack-k8s-operators:main Mar 12, 2026
7 checks passed

openshift-cherrypick-robot mentioned this pull request Mar 13, 2026

[18.0-fr5] Tune API probes #1086

Merged

Conversation

gibizer commented Feb 18, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gibizer commented Feb 18, 2026

Uh oh!

softwarefactory-project-zuul bot commented Feb 18, 2026

Uh oh!

bogdando left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gibizer commented Mar 6, 2026

Uh oh!

gibizer commented Mar 11, 2026

Uh oh!

gibizer commented Mar 11, 2026

Uh oh!

gibizer commented Mar 11, 2026

Uh oh!

gibizer commented Mar 11, 2026

Uh oh!

mrkisaolamb left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Mar 12, 2026

Uh oh!

bogdando commented Mar 12, 2026

Uh oh!

Uh oh!

gibizer commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gibizer commented Mar 13, 2026

Uh oh!

openshift-cherrypick-robot commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gibizer commented Feb 18, 2026 •

edited by openshift-ci bot

Loading

bogdando left a comment •

edited

Loading

gibizer commented Mar 12, 2026 •

edited

Loading