Skip to content

test(e2e-mw-dev): port the kind k8s suite into the real-cluster harness; retire tests/k8s#674

Merged
benben merged 1 commit into
mainfrom
ben/e2e-mwdev-refactor-k8s
Jun 4, 2026
Merged

test(e2e-mw-dev): port the kind k8s suite into the real-cluster harness; retire tests/k8s#674
benben merged 1 commit into
mainfrom
ben/e2e-mwdev-refactor-k8s

Conversation

@benben
Copy link
Copy Markdown
Member

@benben benben commented Jun 4, 2026

What

The kind-based tests/k8s/ suite tested the multi-tenant activation pipeline against a fake cluster — it could not exercise the layers where this quarter's bugs actually lived (real Cilium, Crossplane ducklings, cnpg-shard + external-RDS metadata, per-org Lakekeeper). The per-PR mw-dev harness (tests/e2e-mw-dev/, merged in #657) can. This PR ports every cluster test into the shell harness, implements the remaining harness TODOs, and removes the kind suite + its CI job.

harness.sh — ported coverage (runs kubectl in-cluster via the Job's SA token)

area assertion from
wire/query SELECT 1 + 5 concurrent distinct connections TestK8sBasicQuery, TestK8sMultipleConcurrentConnections
activation DuckLake and Iceberg attach + R/W, on cnpg AND ext DuckLake/Iceberg round-trip tests
ext forks bundled ducklake/httpfs are the PostHog forks, not upstream *IsBundledFork
worker pods labels, securityContext (non-root/uid1000/no-priv-esc), Downward-API POD_NAME/NODE_NAME, no SA-token mount TestK8sWorkerPodCreation, *SecurityContext, *StampedWithPodAndNode, *DoNotMountServiceAccountToken
resilience worker-pod kill crash recovery; DuckLake durability across worker restart; concurrent writers (fork conflict-retry — the test flaking on main) *WorkerCrashRecovery, *DurabilityAcrossWorkerRestart, *ConcurrentWriters
isolation cnpg vs ext see distinct catalogs; cross-tenant read denied *DifferentTenantsSeeDistinctCatalogs
lifecycle deprovision → wait Duckling CR --for=delete → re-provision the same org id → R/W again the stranded-cnpg-role regression net (#649/#650/#11518/#11522)

The lifecycle/recreate check is reliable now because it waits on the Crossplane Duckling CR's finalizer cascade (which drops the cnpg role+db) instead of racing on warehouse=deleted — the bug the old harness comment flagged as un-portable.

TODOs resolved

  • durability / concurrency — implemented (table above).
  • same-org recreate — implemented with the CR --for=delete wait.
  • async-incomplete teardown — teardown + recreate are now CR-synchronous; drop_cnpg_role stays as an idempotent backstop.
  • e2e-cleanup (was "janitor") — run.sh e2e-cleanup + a 6h schedule trigger in e2e-mw-dev.yml reap stale duckgres-ci-pr-* namespaces (+ ducklings, cnpg role+db, Pod Identity association, cross-ns bindings). Renamed away from "janitor" to avoid colliding with duckgres's own control-plane janitor.

tests/k8s removal

  • Delete the kind Go suite + its testdata + CLAUDE.md. The cluster tests are ported; the remaining unit tests covered test-only kind-harness helpers (port-forward state machine, transient-DB/pod-gone detection, setup loader) that retire with the harness.
  • The RBAC + network-policy static-manifest asserts (the only unit tests over real shipped k8s/ config) move to tests/manifests/ and run in the normal go test ./... lane.
  • Remove the k8s-integration-tests job from ci.yml (and its iceberg OIDC/STS wiring — that path is now in the e2e harness).

Kept for now

The supporting k8s/ scripts/manifests + Dockerfile* stay (per request). The just test-k8s-integration recipe is now a dangling reference (its ./tests/k8s/... target is gone) — a later cleanup PR removes it.

Deliberately not ported (documented in README)

  • Shared-warm-worker activation + version-mismatch reaper — per-PR CP runs DUCKGRES_K8S_SHARED_WARM_TARGET=0 (no idle warm workers); stay covered by controlplane/ unit tests.
  • Physical object-store-prefix isolation — the Job holds no S3 list creds against real mw-dev S3; isolation asserted logically (cross-tenant read denied).
  • Cilium egress allow/deny probing — needs a stable in-worker exec probe (high flake); the policies are asserted statically in tests/manifests/.

Validation

  • go test ./tests/manifests/ ✅, gofmt clean, sh -n/bash -n on the scripts ✅ locally.
  • Full e2e validated by this PR's e2e-mw-dev run against real mw-dev.

@benben benben force-pushed the ben/e2e-mwdev-refactor-k8s branch 4 times, most recently from d3d4a38 to d9b9c46 Compare June 4, 2026 09:24
…ss; retire tests/k8s

The kind-based tests/k8s/ suite tested the multi-tenant activation pipeline
against a fake cluster — it could not exercise the layers where this quarter's
bugs actually lived (real Cilium, Crossplane ducklings, cnpg-shard + external
RDS metadata, per-org Lakekeeper). The per-PR mw-dev harness can. Port every
cluster test into harness.sh, implement the remaining TODOs, and remove the
kind suite + its CI job.

harness.sh (now runs kubectl in-cluster via the Job's SA token):
- wire/query: SELECT 1 + 5 concurrent distinct connections
- activation: DuckLake + Iceberg attach + R/W on cnpg AND ext backends
- extension forks: bundled ducklake/httpfs are the PostHog forks, not upstream
- worker pods: labels, securityContext (non-root/uid1000/no-priv-esc),
  Downward-API POD_NAME/NODE_NAME env, no ambient SA-token mount
- resilience: worker-pod kill crash recovery; DuckLake durability across a
  worker restart; concurrent writers (fork conflict-retry)
- isolation: cnpg vs ext see distinct catalogs, cross-tenant read denied
- lifecycle: deprovision -> wait Duckling CR --for=delete -> re-provision the
  SAME org id -> R/W again (the stranded-cnpg-role regression net, now reliable
  because it waits on the CR's finalizer cascade instead of warehouse=deleted)

run.sh: add a `janitor` subcommand; e2e-mw-dev.yml gains a 6h `schedule`
trigger that runs only the janitor (reaps stale duckgres-ci-pr-* namespaces +
their ducklings/cnpg-role/PIA/bindings). NAMESPACE no longer required for janitor.

tests/k8s removal:
- delete the kind Go suite + its testdata + CLAUDE.md (cluster tests ported;
  test-only harness helpers retired with it)
- the RBAC + network-policy static-manifest asserts (the only unit tests over
  real shipped k8s/ config) move to tests/manifests/ and run in `go test ./...`
- remove the k8s-integration-tests job from ci.yml

The supporting k8s/ scripts/manifests + Dockerfiles are kept for now (a later
cleanup PR removes the now-dangling `just test-k8s-integration` recipe).
Deliberately not ported: warm-pool activation + version-reaper (per-PR CP runs
warm-target=0), physical S3-prefix isolation (no list creds from the Job),
Cilium egress probing (needs a stable in-worker exec) — documented in README.
@benben benben force-pushed the ben/e2e-mwdev-refactor-k8s branch from d9b9c46 to 493ebac Compare June 4, 2026 09:39
@benben benben requested a review from a team June 4, 2026 09:46
@benben benben merged commit 296adbc into main Jun 4, 2026
24 checks passed
@benben benben deleted the ben/e2e-mwdev-refactor-k8s branch June 4, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant