fix: evict cached Temporal client when deletion cleanup fails#329
Open
rupesh-parab-one-app wants to merge 1 commit into
Open
Conversation
handleDeletion fetches an SDK client from TemporalClientPool to drive the WorkerDeployment finalizer (clear ramping version → set current to unversioned → delete versions → delete deployment). If any of those Temporal server-side calls fails, the cached client stays in the pool and the next reconcile reuses the same instance — which is fine for recoverable server errors but pathological if the client itself is broken at the transport layer. We hit this in production: the very first reconcile after install raced the API-key Secret materialisation. The SDK Dial succeeded against Temporal Cloud (host:port reachable) but every subsequent gRPC call returned `context.DeadlineExceeded`. That client was pooled, the WorkerDeployment was then deleted (via a Helm uninstall / chart re-render), and `handleDeletion.Describe` started returning the same deadline-exceeded error on every requeue. The finalizer loop ran ~240 retries over 40 minutes and only cleared after an operator restarted the controller pod, which dropped the in-memory pool. The main `Reconcile` path already evicts the cached client when it sees `serviceerror.PermissionDenied` / `Unauthenticated` (via `isAccessDeniedErr`), but `handleDeletion` has no eviction at all and transport-layer errors don't surface as access-denied. Add a defer in `handleDeletion` that evicts the pooled client on any non-nil return after the client is acquired. NotFound on Describe is treated as success (the deployment is already gone), so the defer skips eviction there — the healthy client stays in the pool. EvictClient is documented as safe-when-missing, so this is also a no-op if the dial path above succeeded but did not pool a client we recognise (defensive: future callers, e.g. fresh dial without caching, won't break this). Tests: new TestHandleDeletion_EvictsCachedClientOnTemporalFailure covers both branches — Describe error evicts, Describe NotFound retains. A no-op `Close()` is added to the existing `stubTemporalClient` so the real `EvictClient` can close the stub without panicking through its embedded nil `sdkclient.Client` interface.
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
When a
WorkerDeploymentis deleted,handleDeletionfetches an SDK client fromTemporalClientPooland walks the cleanup sequence (clear ramping → set current to unversioned → delete versions → delete deployment). If any of those Temporal-server-side calls fails with a transport-layer error (e.g.context.DeadlineExceededagainst a poisoned client), the cached client stays in the pool and every requeue of the finalizer re-uses the same broken instance.This PR adds a
deferinhandleDeletion, immediately after the SDK client is acquired, that evicts the cached client whenever the function returns a non-nil error.NotFoundonDescribeis intentionally treated as success and returnsnil, so the defer skips eviction there — the healthy client stays in the pool.Why
The main
Reconcilepath already evicts onserviceerror.PermissionDenied/Unauthenticated(viaisAccessDeniedErr), buthandleDeletionhas no eviction at all. We hit this in production: the very first reconcile after install raced an ExternalSecrets-backed API-key Secret materialisation. The SDKDialsucceeded against Temporal Cloud (host:port reachable), but every subsequent gRPC call returnedcontext.DeadlineExceeded. The deletion finalizer then re-ran ~240 times over ~40 minutes against the same poisoned client, and the wedge only cleared after we restarted the controller pod (which drops the in-memory pool).Sanitised production timeline and code refs are in #328.
Changes
internal/controller/worker_controller.go: changehandleDeletionto a named-return signature and add adeferthat callsr.TemporalClientPool.EvictClient(clientPoolKey)when the return error is non-nil. TheclientPoolKeyis hoisted into its own local so both the cache lookup and the defer use the same key.internal/controller/reconciler_events_test.go: addTestHandleDeletion_EvictsCachedClientOnTemporalFailurewith two sub-tests:DescribeError_EvictsClient— pre-populates the pool with a stub client whoseDescribereturnscontext.DeadlineExceeded, callshandleDeletion, asserts the error propagates and the client is gone from the pool.DescribeNotFound_RetainsClient— uses the existingnewStubTemporalClienthelper (which returnsserviceerror.NotFoundonDescribe), assertshandleDeletionreturnsniland the healthy client is retained.Close()is added to the existingstubTemporalClientso thatClientPool.EvictClient(which callsinfo.client.Close()) doesn't NPE through the stub's embedded nilsdkclient.Clientinterface.EvictClientis documented as safe-when-missing, so the defer is also a no-op on the dial-but-don't-cache fallback in case future callers stop pooling.Test plan
go test ./internal/controller/...(controller, clientpool, k8s.io/utils) — all greengo vet ./internal/controller/...— cleanmain(manually verified by reverting the defer locally — the eviction assert fails)make test-unit+make test-integration(the integration suite istest_dep-tagged and exercises the real deletion path viatemporaltest.TestServer)Alternative considered (broader)
Track consecutive Temporal-server failures per
ClientPoolKeyin the mainReconcilepath and evict after N failures, or wrap everyDescribe/GetWorkerDeploymentStatecall in an eviction-on-transport-error helper. This would also handle the (so-far-not-observed) case of a poisoned client in steady-state reconciles, not just at deletion time.Not included in this PR because:
Connection.HostPortchange path already self-recovers.Happy to expand scope to the broader direction if maintainers would prefer that instead — let me know and I'll layer it on.
Closes #328.