Skip to content

Scope Deployment cache to worker deployments#331

Open
Shivs11 wants to merge 3 commits into
mainfrom
shivam/mem-relaxer
Open

Scope Deployment cache to worker deployments#331
Shivs11 wants to merge 3 commits into
mainfrom
shivam/mem-relaxer

Conversation

@Shivs11
Copy link
Copy Markdown
Member

@Shivs11 Shivs11 commented May 22, 2026

Summary

This scopes the controller-runtime cache for apps/v1.Deployment objects to only Deployments labeled with temporal.io/deployment-name.

Why

We investigated a report where one controller manager pod was using several GiB of memory while CPU was almost idle. The cluster only had one TemporalWorkerDeployment, but it had thousands of normal Kubernetes Deployment objects for unrelated services.

The confusing part is how Owns(&appsv1.Deployment{}) behaves in controller-runtime.

A useful way to think about this is that there are two separate layers:

  1. The manager starts a Kubernetes Deployment watcher/cache. This layer asks the API server for Deployment objects, watches for Deployment changes, and keeps the watched objects in memory.
  2. The controller event handler then decides which of those Deployment events should enqueue a reconcile. Owns(&appsv1.Deployment{}) belongs to this second layer: it makes sure only Deployments owned by a TemporalWorkerDeployment trigger reconciles for that owner.

Before this PR, layer 2 was already narrow, but layer 1 was still broad. So the controller was only reconciling its own Deployments, but the underlying Deployment watcher/cache could still list, watch, and retain unrelated Deployments from the cluster.

That means memory could grow with all Kubernetes Deployments in the cluster, not just Temporal-managed worker Deployments.

This PR narrows the first layer too. The manager cache now only admits Deployments that have the existing temporal.io/deployment-name label, which is the label we already put on worker Deployments generated by this controller.

Changes

  • Export the existing temporal.io/deployment-name label constant as k8s.WorkerDeploymentNameLabel.
  • Configure cache.Options.ByObject for appsv1.Deployment with an existence selector on temporal.io/deployment-name.
  • Add unit coverage that the manager cache selector only admits Deployments with the worker label.
  • Add unit coverage that generated worker Deployments carry the label the cache selector depends on.
  • Update tests to use the exported label constant.

Notes

This does not introduce a new Kubernetes label. Generated worker Deployments already had temporal.io/deployment-name; this change reuses that label to scope the Deployment watcher/cache.

Residual risk: a pre-existing or manually modified controller-owned Deployment missing temporal.io/deployment-name will no longer be visible to the cached Deployment reader/watch. Normal upgrade risk should be low because the controller-generated Deployments already used this label before the PR.

Testing

  • go test ./cmd
  • go test ./internal/k8s
  • go test ./...
  • git diff --check

@Shivs11 Shivs11 requested review from a team and jlegrone as code owners May 22, 2026 14:34
Comment thread cmd/main_test.go
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just made a new file here only because the main changes were also in the "main" file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant