fix: evict cached Temporal client when deletion cleanup fails by rupesh-parab-one-app · Pull Request #329 · temporalio/temporal-worker-controller

rupesh-parab-one-app · 2026-05-21T13:28:25Z

What

When a WorkerDeployment is deleted, handleDeletion fetches an SDK client from TemporalClientPool and walks the cleanup sequence (clear ramping → set current to unversioned → delete versions → delete deployment). If any of those Temporal-server-side calls fails with a transport-layer error (e.g. context.DeadlineExceeded against a poisoned client), the cached client stays in the pool and every requeue of the finalizer re-uses the same broken instance.

This PR adds a defer in handleDeletion, immediately after the SDK client is acquired, that evicts the cached client whenever the function returns a non-nil error. NotFound on Describe is intentionally treated as success and returns nil, so the defer skips eviction there — the healthy client stays in the pool.

Why

The main Reconcile path already evicts on serviceerror.PermissionDenied / Unauthenticated (via isAccessDeniedErr), but handleDeletion has no eviction at all. We hit this in production: the very first reconcile after install raced an ExternalSecrets-backed API-key Secret materialisation. The SDK Dial succeeded against Temporal Cloud (host:port reachable), but every subsequent gRPC call returned context.DeadlineExceeded. The deletion finalizer then re-ran ~240 times over ~40 minutes against the same poisoned client, and the wedge only cleared after we restarted the controller pod (which drops the in-memory pool).

Sanitised production timeline and code refs are in #328.

Changes

internal/controller/worker_controller.go: change handleDeletion to a named-return signature and add a defer that calls r.TemporalClientPool.EvictClient(clientPoolKey) when the return error is non-nil. The clientPoolKey is hoisted into its own local so both the cache lookup and the defer use the same key.
internal/controller/reconciler_events_test.go: add TestHandleDeletion_EvictsCachedClientOnTemporalFailure with two sub-tests:
- DescribeError_EvictsClient — pre-populates the pool with a stub client whose Describe returns context.DeadlineExceeded, calls handleDeletion, asserts the error propagates and the client is gone from the pool.
- DescribeNotFound_RetainsClient — uses the existing newStubTemporalClient helper (which returns serviceerror.NotFound on Describe), asserts handleDeletion returns nil and the healthy client is retained.
A small no-op Close() is added to the existing stubTemporalClient so that ClientPool.EvictClient (which calls info.client.Close()) doesn't NPE through the stub's embedded nil sdkclient.Client interface.

EvictClient is documented as safe-when-missing, so the defer is also a no-op on the dial-but-don't-cache fallback in case future callers stop pooling.

Test plan

go test ./internal/controller/... (controller, clientpool, k8s.io/utils) — all green
go vet ./internal/controller/... — clean
New subtests pass and demonstrate the regression is caught by failing on main (manually verified by reverting the defer locally — the eviction assert fails)
Maintainer-side make test-unit + make test-integration (the integration suite is test_dep-tagged and exercises the real deletion path via temporaltest.TestServer)

Alternative considered (broader)

Track consecutive Temporal-server failures per ClientPoolKey in the main Reconcile path and evict after N failures, or wrap every Describe / GetWorkerDeploymentState call in an eviction-on-transport-error helper. This would also handle the (so-far-not-observed) case of a poisoned client in steady-state reconciles, not just at deletion time.

Not included in this PR because:

It adds state and changes hot-path behaviour for every reconcile, not just the deletion finalizer.
The deletion-cleanup path is the one we observed wedged in production; the main reconcile has periodic requeues and the Connection.HostPort change path already self-recovers.
The minimal fix here closes the production regression without expanding the surface area.

Happy to expand scope to the broader direction if maintainers would prefer that instead — let me know and I'll layer it on.

Closes #328.

handleDeletion fetches an SDK client from TemporalClientPool to drive the WorkerDeployment finalizer (clear ramping version → set current to unversioned → delete versions → delete deployment). If any of those Temporal server-side calls fails, the cached client stays in the pool and the next reconcile reuses the same instance — which is fine for recoverable server errors but pathological if the client itself is broken at the transport layer. We hit this in production: the very first reconcile after install raced the API-key Secret materialisation. The SDK Dial succeeded against Temporal Cloud (host:port reachable) but every subsequent gRPC call returned `context.DeadlineExceeded`. That client was pooled, the WorkerDeployment was then deleted (via a Helm uninstall / chart re-render), and `handleDeletion.Describe` started returning the same deadline-exceeded error on every requeue. The finalizer loop ran ~240 retries over 40 minutes and only cleared after an operator restarted the controller pod, which dropped the in-memory pool. The main `Reconcile` path already evicts the cached client when it sees `serviceerror.PermissionDenied` / `Unauthenticated` (via `isAccessDeniedErr`), but `handleDeletion` has no eviction at all and transport-layer errors don't surface as access-denied. Add a defer in `handleDeletion` that evicts the pooled client on any non-nil return after the client is acquired. NotFound on Describe is treated as success (the deployment is already gone), so the defer skips eviction there — the healthy client stays in the pool. EvictClient is documented as safe-when-missing, so this is also a no-op if the dial path above succeeded but did not pool a client we recognise (defensive: future callers, e.g. fresh dial without caching, won't break this). Tests: new TestHandleDeletion_EvictsCachedClientOnTemporalFailure covers both branches — Describe error evicts, Describe NotFound retains. A no-op `Close()` is added to the existing `stubTemporalClient` so the real `EvictClient` can close the stub without panicking through its embedded nil `sdkclient.Client` interface.

CLAassistant · 2026-05-21T13:49:48Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

rupesh-parab-one-app requested review from a team and jlegrone as code owners May 21, 2026 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: evict cached Temporal client when deletion cleanup fails#329

fix: evict cached Temporal client when deletion cleanup fails#329
rupesh-parab-one-app wants to merge 1 commit into
temporalio:mainfrom
rupesh-parab-one-app:fix/evict-client-on-deletion-cleanup-failure

rupesh-parab-one-app commented May 21, 2026

Uh oh!

CLAassistant commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rupesh-parab-one-app commented May 21, 2026

What

Why

Changes

Test plan

Alternative considered (broader)

Uh oh!

CLAassistant commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants