You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Creating a Kubernetes sandbox is a cold start: the gateway creates a Sandbox CR, the agent-sandbox controller schedules a Pod, the image is pulled (or read from cache), the supervisor boots, and only then is the sandbox Ready. Measured locally this is ~4s+ even with the image preloaded. For interactive agent workloads and high-churn "fresh sandbox per task" usage, that latency dominates time-to-first-action.
We want near-instant Kubernetes sandbox provisioning via a warm pool of pre-provisioned, ready Pods.
This issue scopes the design (moved here from PR #1813 per the issue-first RFC process). It supersedes the draft RFC and incorporates the security review on that PR.
Proposed Design
Adopt the upstream agent-sandbox warm-pool extension CRDs — SandboxTemplate, SandboxWarmPool, SandboxClaim (extensions.agents.x-k8s.io/v1alpha1) — already shipped in the v0.4.6 release OpenShell pins for the core Sandbox CRD. The gateway pre-declares operator-owned warm pools; on CreateSandbox, when the requested shape matches a pool, the Kubernetes driver creates a SandboxClaim that binds a pre-warmed Pod in ~0.1s. Non-matching requests fall back to the existing cold path.
Identity re-anchor (security-critical)
Today validate_sandbox_owner_reference() (crates/openshell-server/src/auth/k8s_sa.rs) cross-checks the owning Sandbox CR label openshell.ai/sandbox-id against the Pod annotation openshell.io/sandbox-id. Warm Sandboxes are created generically by the pool controller and carry agents.x-k8s.io/claim-uid + a controlling SandboxClaim ownerReference instead, so identity must re-anchor to the gateway-created SandboxClaim. The IssueSandboxToken warm path must enforce a strict fail-closed chain, with any mismatch rejecting:
TokenReview audience/SA/pod-name/pod-UID match the live Pod.
Pod has exactly one controlling Sandbox ownerReference (matching UID).
That Sandbox has exactly one controlling SandboxClaim ownerReference (matching name + UID); agents.x-k8s.io/claim-uid agrees.
The live SandboxClaim exists, has the expected UID, is bound, and status.sandbox.name equals the owning Sandbox.
The gateway Store has a durable, gateway-created record for (namespace, claim name, claim UID) containing the expected sandbox ID, and it equals the Pod's openshell.io/sandbox-id annotation.
The mapping must live in the shared gateway Store (HA-safe; any replica may handle bootstrap), and claim creation must be ordered before a Pod can bootstrap. Users must not be able to set reserved metadata (openshell.io/*, agents.x-k8s.io/*, identity/SPIFFE keys, SandboxClaim.spec.additionalPodMetadata, spec.env).
Workspace isolation (single-use) — the key design constraint
OpenShell's /sandbox is a PVC seeded once by an init container and preserved for the sandbox lifetime. Live testing (see Agent Investigation) shows: each warm Sandbox gets its own PVC (no sharing), and the pool replenishes with fresh Sandboxes (a claimed Sandbox is not recycled). However, under the upstream default SandboxClaim.spec.lifecycle.shutdownPolicy: Retain, deleting the claim deletes the Sandbox/Pod but leaves the workspace PVC orphaned and intact (verified — a written marker survived). Invariants this design must enforce:
Claimed Sandbox/Pod/PVC are single-use: consumed once, then destroyed — never returned to the pool.
OpenShell must explicitly destroy the workspace PVC on teardown (set shutdownPolicy: Delete/DeleteForeground and/or owner-ref/finalizer-driven PVC deletion). The Retain default is unsafe and must not be used unless PVC reuse is provably impossible.
Each warm Sandbox uses per-Sandbox volumeClaimTemplates (no shared workspace volume).
A warm Pod is seeded pristine from the image before it is claimable; it must never run user workload code while pooled.
Single-use still preserves the latency win: a warm Pod is pre-scheduled, image-pulled, supervisor-booted, and workspace-seeded before claim (~4s → ~0.1s measured).
Scope guardrails
Initially, only operator-declared pools using trusted templates/images are warm-pooled; user-supplied images or arbitrary per-request templates fall back to the cold path until per-tenant pool isolation and cleanup guarantees are designed.
What bakes vs. late-binds
Baked into the shared SandboxTemplate: image, mTLS mount, projected SA-token volume, supervisor sideload, capabilities, host aliases, runtimeClass, resources, workspace VCT.
Injected per-claim (annotation only):openshell.io/sandbox-id (per-claim env is rejected on the warm path).
Late-bound over the supervisor relay (already works): policy, providers. Identity is established by the existing token exchange, not Pod env.
Phased plan
Settle this design (this issue).
Driver warm path (flagged): create SandboxClaim instead of Sandbox for pooled shapes; gateway RBAC for extensions.agents.x-k8s.io; durable Store claim mapping. Install extensions.yaml in dev/e2e alongside this consumer (not before).
Auth re-anchor: implement the fail-closed chain in k8s_sa.rs + adversarial tests.
Patch identity onto the claimed Pod after bind (keep the label cross-check): requires granting the gateway patch pods (deliberately denied for immutability) and is racy.
Do nothing: accept cold-start latency. Viable for low-churn usage, poor for interactive agents.
Agent Investigation
Validated on a local k3s (k3d) cluster with agent-sandbox v0.4.6 (core + extensions):
Identity: claim binds in ~0.13s; the warm Pod is owned by a controlling Sandbox CR (ownerRef chain intact); the claim-injected openshell.io/sandbox-id annotation lands on the Pod; per-claim env is rejected on the warm path. The bound Sandbox carries agents.x-k8s.io/claim-uid + a controlling SandboxClaim ownerRef. The current validate_sandbox_owner_reference()fails closed against warm Sandboxes (they lack openshell.ai/sandbox-id), so there is no exploit in the install-only PR.
Workspace:SandboxTemplate.volumeClaimTemplates → each warm Sandbox gets its ownBound PVC (2 warm pods → 2 distinct PVCs, each owned by its Sandbox). Claiming replenished the pool with a newSandbox + new PVC (claimed one not recycled). Deleting the claim deleted the Sandbox but left the workspace PVC Bound holding a written TENANT-A-SECRET marker — shutdownPolicy default is Retain, confirming the orphaned-workspace data risk.
Baseline: the cold path is unaffected — sandbox create → Ready, IssueSandboxToken → minted JWT, echo executed inside the sandbox over the supervisor relay.
Checklist
I've reviewed existing issues and the architecture docs
This is a design proposal, not a "please build this" request
Problem Statement
Creating a Kubernetes sandbox is a cold start: the gateway creates a
SandboxCR, the agent-sandbox controller schedules a Pod, the image is pulled (or read from cache), the supervisor boots, and only then is the sandboxReady. Measured locally this is ~4s+ even with the image preloaded. For interactive agent workloads and high-churn "fresh sandbox per task" usage, that latency dominates time-to-first-action.We want near-instant Kubernetes sandbox provisioning via a warm pool of pre-provisioned, ready Pods.
This issue scopes the design (moved here from PR #1813 per the issue-first RFC process). It supersedes the draft RFC and incorporates the security review on that PR.
Proposed Design
Adopt the upstream agent-sandbox warm-pool extension CRDs —
SandboxTemplate,SandboxWarmPool,SandboxClaim(extensions.agents.x-k8s.io/v1alpha1) — already shipped in thev0.4.6release OpenShell pins for the coreSandboxCRD. The gateway pre-declares operator-owned warm pools; onCreateSandbox, when the requested shape matches a pool, the Kubernetes driver creates aSandboxClaimthat binds a pre-warmed Pod in ~0.1s. Non-matching requests fall back to the existing cold path.Identity re-anchor (security-critical)
Today
validate_sandbox_owner_reference()(crates/openshell-server/src/auth/k8s_sa.rs) cross-checks the owningSandboxCR labelopenshell.ai/sandbox-idagainst the Pod annotationopenshell.io/sandbox-id. Warm Sandboxes are created generically by the pool controller and carryagents.x-k8s.io/claim-uid+ a controllingSandboxClaimownerReference instead, so identity must re-anchor to the gateway-createdSandboxClaim. TheIssueSandboxTokenwarm path must enforce a strict fail-closed chain, with any mismatch rejecting:SandboxownerReference (matching UID).Sandboxhas exactly one controllingSandboxClaimownerReference (matching name + UID);agents.x-k8s.io/claim-uidagrees.SandboxClaimexists, has the expected UID, is bound, andstatus.sandbox.nameequals the owningSandbox.(namespace, claim name, claim UID)containing the expected sandbox ID, and it equals the Pod'sopenshell.io/sandbox-idannotation.The mapping must live in the shared gateway Store (HA-safe; any replica may handle bootstrap), and claim creation must be ordered before a Pod can bootstrap. Users must not be able to set reserved metadata (
openshell.io/*,agents.x-k8s.io/*, identity/SPIFFE keys,SandboxClaim.spec.additionalPodMetadata,spec.env).Workspace isolation (single-use) — the key design constraint
OpenShell's
/sandboxis a PVC seeded once by an init container and preserved for the sandbox lifetime. Live testing (see Agent Investigation) shows: each warmSandboxgets its own PVC (no sharing), and the pool replenishes with fresh Sandboxes (a claimed Sandbox is not recycled). However, under the upstream defaultSandboxClaim.spec.lifecycle.shutdownPolicy: Retain, deleting the claim deletes theSandbox/Pod but leaves the workspace PVC orphaned and intact (verified — a written marker survived). Invariants this design must enforce:Sandbox/Pod/PVC are single-use: consumed once, then destroyed — never returned to the pool.shutdownPolicy: Delete/DeleteForegroundand/or owner-ref/finalizer-driven PVC deletion). TheRetaindefault is unsafe and must not be used unless PVC reuse is provably impossible.Sandboxuses per-SandboxvolumeClaimTemplates(no shared workspace volume).Single-use still preserves the latency win: a warm Pod is pre-scheduled, image-pulled, supervisor-booted, and workspace-seeded before claim (~4s → ~0.1s measured).
Scope guardrails
Initially, only operator-declared pools using trusted templates/images are warm-pooled; user-supplied images or arbitrary per-request templates fall back to the cold path until per-tenant pool isolation and cleanup guarantees are designed.
What bakes vs. late-binds
SandboxTemplate: image, mTLS mount, projected SA-token volume, supervisor sideload, capabilities, host aliases, runtimeClass, resources, workspace VCT.openshell.io/sandbox-id(per-claimenvis rejected on the warm path).Phased plan
SandboxClaiminstead ofSandboxfor pooled shapes; gateway RBAC forextensions.agents.x-k8s.io; durable Store claim mapping. Installextensions.yamlin dev/e2e alongside this consumer (not before).k8s_sa.rs+ adversarial tests.Alternatives Considered
patch pods(deliberately denied for immutability) and is racy.SandboxCRs — see upstream issue refactor(build): unify image build graph for cache reuse #390): would break the ownerReference auth chain. v0.4.6 createsSandboxCRs.Agent Investigation
Validated on a local k3s (k3d) cluster with agent-sandbox
v0.4.6(core + extensions):SandboxCR (ownerRef chain intact); the claim-injectedopenshell.io/sandbox-idannotation lands on the Pod; per-claimenvis rejected on the warm path. The boundSandboxcarriesagents.x-k8s.io/claim-uid+ a controllingSandboxClaimownerRef. The currentvalidate_sandbox_owner_reference()fails closed against warm Sandboxes (they lackopenshell.ai/sandbox-id), so there is no exploit in the install-only PR.SandboxTemplate.volumeClaimTemplates→ each warmSandboxgets its ownBoundPVC (2 warm pods → 2 distinct PVCs, each owned by itsSandbox). Claiming replenished the pool with a newSandbox+ new PVC (claimed one not recycled). Deleting the claim deleted theSandboxbut left the workspace PVCBoundholding a writtenTENANT-A-SECRETmarker —shutdownPolicydefault isRetain, confirming the orphaned-workspace data risk.sandbox create→Ready,IssueSandboxToken→ minted JWT,echoexecuted inside the sandbox over the supervisor relay.Checklist