Warm-pooled sandboxes for the Kubernetes compute driver

## Problem Statement

Creating a Kubernetes sandbox is a cold start: the gateway creates a `Sandbox` CR, the agent-sandbox controller schedules a Pod, the image is pulled (or read from cache), the supervisor boots, and only then is the sandbox `Ready`. Measured locally this is ~4s+ even with the image preloaded. For interactive agent workloads and high-churn "fresh sandbox per task" usage, that latency dominates time-to-first-action.

We want near-instant Kubernetes sandbox provisioning via a **warm pool** of pre-provisioned, ready Pods.

This issue scopes the design (moved here from PR #1813 per the issue-first RFC process). It supersedes the draft RFC and incorporates the security review on that PR.

## Proposed Design

Adopt the upstream agent-sandbox **warm-pool extension CRDs** — `SandboxTemplate`, `SandboxWarmPool`, `SandboxClaim` (`extensions.agents.x-k8s.io/v1alpha1`) — already shipped in the `v0.4.6` release OpenShell pins for the core `Sandbox` CRD. The gateway pre-declares **operator-owned** warm pools; on `CreateSandbox`, when the requested shape matches a pool, the Kubernetes driver creates a `SandboxClaim` that binds a pre-warmed Pod in ~0.1s. Non-matching requests fall back to the existing cold path.

### Identity re-anchor (security-critical)

Today `validate_sandbox_owner_reference()` (`crates/openshell-server/src/auth/k8s_sa.rs`) cross-checks the owning `Sandbox` CR label `openshell.ai/sandbox-id` against the Pod annotation `openshell.io/sandbox-id`. Warm Sandboxes are created generically by the pool controller and carry `agents.x-k8s.io/claim-uid` + a controlling `SandboxClaim` ownerReference instead, so identity must re-anchor to the **gateway-created `SandboxClaim`**. The `IssueSandboxToken` warm path must enforce a strict fail-closed chain, with any mismatch rejecting:

- TokenReview audience/SA/pod-name/pod-UID match the live Pod.
- Pod has exactly one controlling `Sandbox` ownerReference (matching UID).
- That `Sandbox` has exactly one controlling `SandboxClaim` ownerReference (matching name + UID); `agents.x-k8s.io/claim-uid` agrees.
- The live `SandboxClaim` exists, has the expected UID, is bound, and `status.sandbox.name` equals the owning `Sandbox`.
- The gateway **Store** has a durable, gateway-created record for `(namespace, claim name, claim UID)` containing the expected sandbox ID, and it equals the Pod's `openshell.io/sandbox-id` annotation.

The mapping must live in the **shared gateway Store** (HA-safe; any replica may handle bootstrap), and claim creation must be ordered before a Pod can bootstrap. Users must **not** be able to set reserved metadata (`openshell.io/*`, `agents.x-k8s.io/*`, identity/SPIFFE keys, `SandboxClaim.spec.additionalPodMetadata`, `spec.env`).

### Workspace isolation (single-use) — the key design constraint

OpenShell's `/sandbox` is a PVC seeded once by an init container and preserved for the sandbox lifetime. Live testing (see Agent Investigation) shows: each warm `Sandbox` gets its **own** PVC (no sharing), and the pool replenishes with **fresh** Sandboxes (a claimed Sandbox is not recycled). **However**, under the upstream default `SandboxClaim.spec.lifecycle.shutdownPolicy: Retain`, deleting the claim deletes the `Sandbox`/Pod but **leaves the workspace PVC orphaned and intact** (verified — a written marker survived). Invariants this design must enforce:

- Claimed `Sandbox`/Pod/PVC are **single-use**: consumed once, then **destroyed** — never returned to the pool.
- OpenShell must **explicitly destroy the workspace PVC** on teardown (set `shutdownPolicy: Delete`/`DeleteForeground` and/or owner-ref/finalizer-driven PVC deletion). The `Retain` default is unsafe and must not be used unless PVC reuse is provably impossible.
- Each warm `Sandbox` uses per-Sandbox `volumeClaimTemplates` (no shared workspace volume).
- A warm Pod is seeded **pristine from the image** before it is claimable; it must never run user workload code while pooled.

Single-use still preserves the latency win: a warm Pod is pre-scheduled, image-pulled, supervisor-booted, and workspace-seeded before claim (~4s → ~0.1s measured).

### Scope guardrails

Initially, only **operator-declared pools using trusted templates/images** are warm-pooled; user-supplied images or arbitrary per-request templates fall back to the cold path until per-tenant pool isolation and cleanup guarantees are designed.

### What bakes vs. late-binds

- **Baked into the shared `SandboxTemplate`:** image, mTLS mount, projected SA-token volume, supervisor sideload, capabilities, host aliases, runtimeClass, resources, workspace VCT.
- **Injected per-claim (annotation only):** `openshell.io/sandbox-id` (per-claim `env` is rejected on the warm path).
- **Late-bound over the supervisor relay (already works):** policy, providers. Identity is established by the existing token exchange, not Pod env.

### Phased plan

1. **Settle this design** (this issue).
2. **Driver warm path (flagged):** create `SandboxClaim` instead of `Sandbox` for pooled shapes; gateway RBAC for `extensions.agents.x-k8s.io`; durable Store claim mapping. Install `extensions.yaml` in dev/e2e **alongside this consumer** (not before).
3. **Auth re-anchor:** implement the fail-closed chain in `k8s_sa.rs` + adversarial tests.
4. **Single-use lifecycle:** explicit claim/PVC destruction; workspace-isolation e2e (claim → write secret → delete → re-claim → assert clean + PVC not reused).
5. **Pool management + surface/docs.**

## Alternatives Considered

- **Patch identity onto the claimed Pod after bind** (keep the label cross-check): requires granting the gateway `patch pods` (deliberately denied for immutability) and is racy.
- **Bare-Pod warm pools** (if upstream pools created Pods, not `Sandbox` CRs — see upstream issue #390): would break the ownerReference auth chain. v0.4.6 creates `Sandbox` CRs.
- **Do nothing:** accept cold-start latency. Viable for low-churn usage, poor for interactive agents.

## Agent Investigation

Validated on a local k3s (k3d) cluster with agent-sandbox `v0.4.6` (core + extensions):

- **Identity:** claim binds in ~0.13s; the warm Pod is owned by a controlling `Sandbox` CR (ownerRef chain intact); the claim-injected `openshell.io/sandbox-id` annotation lands on the Pod; per-claim `env` is rejected on the warm path. The bound `Sandbox` carries `agents.x-k8s.io/claim-uid` + a controlling `SandboxClaim` ownerRef. The current `validate_sandbox_owner_reference()` **fails closed** against warm Sandboxes (they lack `openshell.ai/sandbox-id`), so there is no exploit in the install-only PR.
- **Workspace:** `SandboxTemplate.volumeClaimTemplates` → each warm `Sandbox` gets its **own** `Bound` PVC (2 warm pods → 2 distinct PVCs, each owned by its `Sandbox`). Claiming replenished the pool with a **new** `Sandbox` + **new** PVC (claimed one not recycled). Deleting the claim deleted the `Sandbox` but **left the workspace PVC `Bound`** holding a written `TENANT-A-SECRET` marker — `shutdownPolicy` default is `Retain`, confirming the orphaned-workspace data risk.
- **Baseline:** the cold path is unaffected — `sandbox create` → `Ready`, `IssueSandboxToken` → minted JWT, `echo` executed inside the sandbox over the supervisor relay.

## Checklist

- [x] I've reviewed existing issues and the architecture docs
- [x] This is a design proposal, not a "please build this" request


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warm-pooled sandboxes for the Kubernetes compute driver #1879

Problem Statement

Proposed Design

Identity re-anchor (security-critical)

Workspace isolation (single-use) — the key design constraint

Scope guardrails

What bakes vs. late-binds

Phased plan

Alternatives Considered

Agent Investigation

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Warm-pooled sandboxes for the Kubernetes compute driver #1879

Description

Problem Statement

Proposed Design

Identity re-anchor (security-critical)

Workspace isolation (single-use) — the key design constraint

Scope guardrails

What bakes vs. late-binds

Phased plan

Alternatives Considered

Agent Investigation

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions