Skip to content

fix: evict cached Temporal client when deletion cleanup fails#329

Open
rupesh-parab-one-app wants to merge 1 commit into
temporalio:mainfrom
rupesh-parab-one-app:fix/evict-client-on-deletion-cleanup-failure
Open

fix: evict cached Temporal client when deletion cleanup fails#329
rupesh-parab-one-app wants to merge 1 commit into
temporalio:mainfrom
rupesh-parab-one-app:fix/evict-client-on-deletion-cleanup-failure

Conversation

@rupesh-parab-one-app
Copy link
Copy Markdown

What

When a WorkerDeployment is deleted, handleDeletion fetches an SDK client from TemporalClientPool and walks the cleanup sequence (clear ramping → set current to unversioned → delete versions → delete deployment). If any of those Temporal-server-side calls fails with a transport-layer error (e.g. context.DeadlineExceeded against a poisoned client), the cached client stays in the pool and every requeue of the finalizer re-uses the same broken instance.

This PR adds a defer in handleDeletion, immediately after the SDK client is acquired, that evicts the cached client whenever the function returns a non-nil error. NotFound on Describe is intentionally treated as success and returns nil, so the defer skips eviction there — the healthy client stays in the pool.

Why

The main Reconcile path already evicts on serviceerror.PermissionDenied / Unauthenticated (via isAccessDeniedErr), but handleDeletion has no eviction at all. We hit this in production: the very first reconcile after install raced an ExternalSecrets-backed API-key Secret materialisation. The SDK Dial succeeded against Temporal Cloud (host:port reachable), but every subsequent gRPC call returned context.DeadlineExceeded. The deletion finalizer then re-ran ~240 times over ~40 minutes against the same poisoned client, and the wedge only cleared after we restarted the controller pod (which drops the in-memory pool).

Sanitised production timeline and code refs are in #328.

Changes

  • internal/controller/worker_controller.go: change handleDeletion to a named-return signature and add a defer that calls r.TemporalClientPool.EvictClient(clientPoolKey) when the return error is non-nil. The clientPoolKey is hoisted into its own local so both the cache lookup and the defer use the same key.
  • internal/controller/reconciler_events_test.go: add TestHandleDeletion_EvictsCachedClientOnTemporalFailure with two sub-tests:
    • DescribeError_EvictsClient — pre-populates the pool with a stub client whose Describe returns context.DeadlineExceeded, calls handleDeletion, asserts the error propagates and the client is gone from the pool.
    • DescribeNotFound_RetainsClient — uses the existing newStubTemporalClient helper (which returns serviceerror.NotFound on Describe), asserts handleDeletion returns nil and the healthy client is retained.
  • A small no-op Close() is added to the existing stubTemporalClient so that ClientPool.EvictClient (which calls info.client.Close()) doesn't NPE through the stub's embedded nil sdkclient.Client interface.

EvictClient is documented as safe-when-missing, so the defer is also a no-op on the dial-but-don't-cache fallback in case future callers stop pooling.

Test plan

  • go test ./internal/controller/... (controller, clientpool, k8s.io/utils) — all green
  • go vet ./internal/controller/... — clean
  • New subtests pass and demonstrate the regression is caught by failing on main (manually verified by reverting the defer locally — the eviction assert fails)
  • Maintainer-side make test-unit + make test-integration (the integration suite is test_dep-tagged and exercises the real deletion path via temporaltest.TestServer)

Alternative considered (broader)

Track consecutive Temporal-server failures per ClientPoolKey in the main Reconcile path and evict after N failures, or wrap every Describe / GetWorkerDeploymentState call in an eviction-on-transport-error helper. This would also handle the (so-far-not-observed) case of a poisoned client in steady-state reconciles, not just at deletion time.

Not included in this PR because:

  • It adds state and changes hot-path behaviour for every reconcile, not just the deletion finalizer.
  • The deletion-cleanup path is the one we observed wedged in production; the main reconcile has periodic requeues and the Connection.HostPort change path already self-recovers.
  • The minimal fix here closes the production regression without expanding the surface area.

Happy to expand scope to the broader direction if maintainers would prefer that instead — let me know and I'll layer it on.

Closes #328.

handleDeletion fetches an SDK client from TemporalClientPool to drive the
WorkerDeployment finalizer (clear ramping version → set current to
unversioned → delete versions → delete deployment). If any of those
Temporal server-side calls fails, the cached client stays in the pool
and the next reconcile reuses the same instance — which is fine for
recoverable server errors but pathological if the client itself is
broken at the transport layer.

We hit this in production: the very first reconcile after install
raced the API-key Secret materialisation. The SDK Dial succeeded
against Temporal Cloud (host:port reachable) but every subsequent
gRPC call returned `context.DeadlineExceeded`. That client was pooled,
the WorkerDeployment was then deleted (via a Helm uninstall / chart
re-render), and `handleDeletion.Describe` started returning the same
deadline-exceeded error on every requeue. The finalizer loop ran
~240 retries over 40 minutes and only cleared after an operator
restarted the controller pod, which dropped the in-memory pool.

The main `Reconcile` path already evicts the cached client when it
sees `serviceerror.PermissionDenied` / `Unauthenticated` (via
`isAccessDeniedErr`), but `handleDeletion` has no eviction at all and
transport-layer errors don't surface as access-denied. Add a defer in
`handleDeletion` that evicts the pooled client on any non-nil return
after the client is acquired. NotFound on Describe is treated as
success (the deployment is already gone), so the defer skips eviction
there — the healthy client stays in the pool.

EvictClient is documented as safe-when-missing, so this is also a
no-op if the dial path above succeeded but did not pool a client we
recognise (defensive: future callers, e.g. fresh dial without
caching, won't break this).

Tests: new TestHandleDeletion_EvictsCachedClientOnTemporalFailure
covers both branches — Describe error evicts, Describe NotFound retains.
A no-op `Close()` is added to the existing `stubTemporalClient` so the
real `EvictClient` can close the stub without panicking through its
embedded nil `sdkclient.Client` interface.
@rupesh-parab-one-app rupesh-parab-one-app requested review from a team and jlegrone as code owners May 21, 2026 13:28
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WorkerDeployment deletion finalizer infinite-loops on transport-level Temporal SDK errors (cached client never evicted)

3 participants