Skip to content

Warm-pooled sandboxes for the Kubernetes compute driver #1879

@rmalani-nv

Description

@rmalani-nv

Problem Statement

Creating a Kubernetes sandbox is a cold start: the gateway creates a Sandbox CR, the agent-sandbox controller schedules a Pod, the image is pulled (or read from cache), the supervisor boots, and only then is the sandbox Ready. Measured locally this is ~4s+ even with the image preloaded. For interactive agent workloads and high-churn "fresh sandbox per task" usage, that latency dominates time-to-first-action.

We want near-instant Kubernetes sandbox provisioning via a warm pool of pre-provisioned, ready Pods.

This issue scopes the design (moved here from PR #1813 per the issue-first RFC process). It supersedes the draft RFC and incorporates the security review on that PR.

Proposed Design

Adopt the upstream agent-sandbox warm-pool extension CRDsSandboxTemplate, SandboxWarmPool, SandboxClaim (extensions.agents.x-k8s.io/v1alpha1) — already shipped in the v0.4.6 release OpenShell pins for the core Sandbox CRD. The gateway pre-declares operator-owned warm pools; on CreateSandbox, when the requested shape matches a pool, the Kubernetes driver creates a SandboxClaim that binds a pre-warmed Pod in ~0.1s. Non-matching requests fall back to the existing cold path.

Identity re-anchor (security-critical)

Today validate_sandbox_owner_reference() (crates/openshell-server/src/auth/k8s_sa.rs) cross-checks the owning Sandbox CR label openshell.ai/sandbox-id against the Pod annotation openshell.io/sandbox-id. Warm Sandboxes are created generically by the pool controller and carry agents.x-k8s.io/claim-uid + a controlling SandboxClaim ownerReference instead, so identity must re-anchor to the gateway-created SandboxClaim. The IssueSandboxToken warm path must enforce a strict fail-closed chain, with any mismatch rejecting:

  • TokenReview audience/SA/pod-name/pod-UID match the live Pod.
  • Pod has exactly one controlling Sandbox ownerReference (matching UID).
  • That Sandbox has exactly one controlling SandboxClaim ownerReference (matching name + UID); agents.x-k8s.io/claim-uid agrees.
  • The live SandboxClaim exists, has the expected UID, is bound, and status.sandbox.name equals the owning Sandbox.
  • The gateway Store has a durable, gateway-created record for (namespace, claim name, claim UID) containing the expected sandbox ID, and it equals the Pod's openshell.io/sandbox-id annotation.

The mapping must live in the shared gateway Store (HA-safe; any replica may handle bootstrap), and claim creation must be ordered before a Pod can bootstrap. Users must not be able to set reserved metadata (openshell.io/*, agents.x-k8s.io/*, identity/SPIFFE keys, SandboxClaim.spec.additionalPodMetadata, spec.env).

Workspace isolation (single-use) — the key design constraint

OpenShell's /sandbox is a PVC seeded once by an init container and preserved for the sandbox lifetime. Live testing (see Agent Investigation) shows: each warm Sandbox gets its own PVC (no sharing), and the pool replenishes with fresh Sandboxes (a claimed Sandbox is not recycled). However, under the upstream default SandboxClaim.spec.lifecycle.shutdownPolicy: Retain, deleting the claim deletes the Sandbox/Pod but leaves the workspace PVC orphaned and intact (verified — a written marker survived). Invariants this design must enforce:

  • Claimed Sandbox/Pod/PVC are single-use: consumed once, then destroyed — never returned to the pool.
  • OpenShell must explicitly destroy the workspace PVC on teardown (set shutdownPolicy: Delete/DeleteForeground and/or owner-ref/finalizer-driven PVC deletion). The Retain default is unsafe and must not be used unless PVC reuse is provably impossible.
  • Each warm Sandbox uses per-Sandbox volumeClaimTemplates (no shared workspace volume).
  • A warm Pod is seeded pristine from the image before it is claimable; it must never run user workload code while pooled.

Single-use still preserves the latency win: a warm Pod is pre-scheduled, image-pulled, supervisor-booted, and workspace-seeded before claim (~4s → ~0.1s measured).

Scope guardrails

Initially, only operator-declared pools using trusted templates/images are warm-pooled; user-supplied images or arbitrary per-request templates fall back to the cold path until per-tenant pool isolation and cleanup guarantees are designed.

What bakes vs. late-binds

  • Baked into the shared SandboxTemplate: image, mTLS mount, projected SA-token volume, supervisor sideload, capabilities, host aliases, runtimeClass, resources, workspace VCT.
  • Injected per-claim (annotation only): openshell.io/sandbox-id (per-claim env is rejected on the warm path).
  • Late-bound over the supervisor relay (already works): policy, providers. Identity is established by the existing token exchange, not Pod env.

Phased plan

  1. Settle this design (this issue).
  2. Driver warm path (flagged): create SandboxClaim instead of Sandbox for pooled shapes; gateway RBAC for extensions.agents.x-k8s.io; durable Store claim mapping. Install extensions.yaml in dev/e2e alongside this consumer (not before).
  3. Auth re-anchor: implement the fail-closed chain in k8s_sa.rs + adversarial tests.
  4. Single-use lifecycle: explicit claim/PVC destruction; workspace-isolation e2e (claim → write secret → delete → re-claim → assert clean + PVC not reused).
  5. Pool management + surface/docs.

Alternatives Considered

  • Patch identity onto the claimed Pod after bind (keep the label cross-check): requires granting the gateway patch pods (deliberately denied for immutability) and is racy.
  • Bare-Pod warm pools (if upstream pools created Pods, not Sandbox CRs — see upstream issue refactor(build): unify image build graph for cache reuse #390): would break the ownerReference auth chain. v0.4.6 creates Sandbox CRs.
  • Do nothing: accept cold-start latency. Viable for low-churn usage, poor for interactive agents.

Agent Investigation

Validated on a local k3s (k3d) cluster with agent-sandbox v0.4.6 (core + extensions):

  • Identity: claim binds in ~0.13s; the warm Pod is owned by a controlling Sandbox CR (ownerRef chain intact); the claim-injected openshell.io/sandbox-id annotation lands on the Pod; per-claim env is rejected on the warm path. The bound Sandbox carries agents.x-k8s.io/claim-uid + a controlling SandboxClaim ownerRef. The current validate_sandbox_owner_reference() fails closed against warm Sandboxes (they lack openshell.ai/sandbox-id), so there is no exploit in the install-only PR.
  • Workspace: SandboxTemplate.volumeClaimTemplates → each warm Sandbox gets its own Bound PVC (2 warm pods → 2 distinct PVCs, each owned by its Sandbox). Claiming replenished the pool with a new Sandbox + new PVC (claimed one not recycled). Deleting the claim deleted the Sandbox but left the workspace PVC Bound holding a written TENANT-A-SECRET marker — shutdownPolicy default is Retain, confirming the orphaned-workspace data risk.
  • Baseline: the cold path is unaffected — sandbox createReady, IssueSandboxToken → minted JWT, echo executed inside the sandbox over the supervisor relay.

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions