Tune API probes#1074
Conversation
|
/hold on TODOs until this fix gets scheduled to be completed in an upcoming sprint |
|
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/c80f40c7b9974374baf045dd909b57e9 ✔️ openstack-meta-content-provider SUCCESS in 1h 05m 59s |
|
/unhold |
|
/hold I'm doing a comparison testing with and without this patch in the same env and I see things that might need further refinement here. |
I will push a new PS with different numbers. While the current ones effectively prevent overload and pod kill the readiness probe is too sensitive and moves the pod way early out of the LB and hence causes unnecessary reactions. We can be slower to react and still prevent the kills. I will have comparable numbers when I push the new version |
|
Measurements
|
|
/unhold |
This will help to easily see how long certain API requests take in highly loaded system. Signed-off-by: Balazs Gibizer <gibi@redhat.com>
While investigated API scaling issues we noticed that our hard code probe configuration is not optimal for scaling nova-api. Instead of immediately killing pods when they are not responding in 30 seconds we should be removing pods from the load balancer first when they are getting overloaded and let them work through their backlog and only kill a pod if it is hanging for an excessive amount of time. Another observation was that we allow configuring APITimeout parameter on our routes but changing that value is not reflected in our probe configs. So even if the customer decides that it is OK if nova-api is responding slower by increasing the APITimeout, our probes does not become more forgiving. This patch changes the probe configuration of nova-api and nova-metadata to: * be quick to remove the pod from the load balancer if it is overloaded via the readiness probe config * be very forgiving about slow responses and only killing the pod if it is hanging for a long time via the liveness probe. * both readiness and liveness probe timeout is now scaling with the APITimeout configuration. Jira: OSPRH-25717 Jira: OSPRH-27192 Signed-off-by: Balazs Gibizer <gibi@redhat.com>
4f76ec2 to
80e6f0c
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gibizer, mrkisaolamb The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The current edition only provides x1.67 for liveness compared to readiness. The testing results look solid enough though, so LGTM |
ed0bef5
into
openstack-k8s-operators:main
This is all a matter of balance. Creating a bigger gap between readiness and liveness can be done in two ways:
|
|
/cherry-pick 18.0-fr5 |
|
@gibizer: new pull request created: #1086 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
While investigated API scaling issues we noticed that our hard code
probe configuration is not optimal for scaling nova-api. Instead of
immediately killing pods when they are not responding in 30 seconds we
should be removing pods from the load balancer first when they are
getting overloaded and let them work through their backlog and only kill
a pod if it is hanging for an excessive amount of time.
Another observation was that we allow configuring APITimeout parameter
on our routes but changing that value is not reflected in our probe
configs. So even if the customer decides that it is OK if nova-api is
responding slower by increasing the APITimeout, our probes does not
become more forgiving.
This patch changes the probe configuration of nova-api and nova-metadata
to:
via the readiness probe config
is hanging for a long time via the liveness probe.
APITimeout configuration.
The apache log config also modified to log the elapsed time service a request.
Jira: OSPRH-25717
Jira: OSPRH-27192