Skip to content

feat: stand up sandbox cluster (Kustomize + Gateway API + push daemon)#61

Merged
themightychris merged 10 commits into
mainfrom
feat/sandbox-deploy
May 18, 2026
Merged

feat: stand up sandbox cluster (Kustomize + Gateway API + push daemon)#61
themightychris merged 10 commits into
mainfrom
feat/sandbox-deploy

Conversation

@themightychris
Copy link
Copy Markdown
Member

Summary

Stands up the first working end-to-end deploy of the rewrite onto the CodeForPhilly sandbox cluster (Linode LKE, sandbox.k8s.phl.io). Live at https://next-v2.codeforphilly.org.

This is the operational follow-up to the deploy plan (#35) — the work needed to take that plan's containerization + chart skeleton and actually get a pod running, holding state across restarts, and serving authenticated users behind TLS. Closes #36 and #37.

What's new

Kustomize manifests (deploy/kustomize/) replacing the Helm chart-based plan. Base + sandbox overlay. Per the spec PR #60 (which leads this one). The Helm chart under deploy/charts/ is left behind for now and will be retired in a follow-up PR alongside the GHA workflow flip.

Gateway API instead of nginx Ingress. Cluster has Envoy Gateway (gatewayClassName: eg); the sandbox subdomain (*.sandbox.k8s.phl.io) CNAMEs to it. Cert-manager provisions TLS via letsencrypt-prod driven by an annotation on the Gateway. The previously-shipped Ingress is removed.

GitHub OAuth integration. Sandbox OAuth App registered under the CodeForPhilly org; credentials sealed into the namespace's env Secret alongside JWT signing key + data-remote URL. The signed-in user is now resolved from the in-memory state index (fastify.inMemoryState.people.get(id)) rather than a slow sheet-level scan that didn't reflect in-process writes.

Push daemon wired (closes #37). repo.startPushDaemon() is now called in API boot when CFP_DATA_REMOTE is set; emits push/retry/error logs and discriminates non-fast-forward (terminal) from transient failures. Fastify onClose stops the daemon on SIGTERM. The change rippled into a small but real return-type change for openPublicStore so the underlying Repository handle is reachable from the plugin layer.

Smart entrypoint reconciliation. Replaces git reset --hard origin/<branch> with state-aware logic:

  • in-sync → no-op
  • behind → fast-forward
  • ahead → push
  • diverged + clean rebase → rebase + push
  • diverged + conflicts → push a conflicts/<UTC-timestamp> branch to origin (loud for operators) and hard-reset local to known-good

Work is never silently dropped. Combined with the push daemon, a pod restart now reliably preserves data.

Iteration polish. GHCR package renamed to match the repo (codeforphilly-ng); imagePullPolicy: Always on the mutable :sandbox tag; --platform=linux/amd64 note in the runbook for Apple Silicon builds; entrypoint hardened against PVC residue and dubious-ownership uid mismatches.

Verification

Stand-up validated end-to-end on the live cluster:

  • TLS issued by letsencrypt-prod (next-v2.codeforphilly.org)
  • API reachable, both stores loaded, FTS ready
  • GitHub OAuth flow completes, session cookie set, SPA recognizes the user
  • Commit pushed to origin/fixture ~1s after sign-in (verified in pod logs)
  • All 213 unit tests pass; tsc --noEmit clean on apps/api

What's deliberately deferred

  • Helm chart at deploy/charts/codeforphilly/ — kept for now; retired in a follow-up alongside .github/workflows/deploy-{staging,production}.yml flipping from helm to kubectl apply -k.
  • The sandbox SSH deploy key was initially registered read-only; rotation to read-write was done in the GitHub UI (the public-key value is unchanged). Documentation in the runbook for fresh-cluster setups will note this for the next bring-up.

Test plan

  • CI green
  • Browser-level sign-in flow on next-v2.codeforphilly.org still works after merge (no namespace/host changes; just code merging)
  • Pod restart preserves logged-in users (validates the daemon + entrypoint loop)

🤖 Generated with Claude Code

themightychris and others added 10 commits May 17, 2026 11:22
Iterating on a manual deploy to the CfP sandbox cluster (Linode LKE).
Replaces the Helm chart with a kustomize base + sandbox overlay per the
direction in the parked specs/architecture.md spec amendment.

What landed:

- `deploy/kustomize/base/` — namespace-agnostic manifests:
  Deployment (single replica, Recreate strategy, non-root, readiness +
  liveness probes), Service, two PVCs (linode-block-storage-retain),
  ConfigMap with non-secret env, ServiceAccount, Ingress (nginx,
  letsencrypt-staging issuer, TLS to a per-overlay host).

- `deploy/kustomize/overlays/sandbox/` — sandbox-specific Namespace,
  ingress host patch (codeforphilly-rewrite.codeforphilly.sandbox.k8s.phl.io),
  and SealedSecret manifests for:
  - codeforphilly-secrets (CFP_JWT_SIGNING_KEY, CFP_DATA_REMOTE)
  - codeforphilly-data-deploy-key (read-only ed25519 SSH key registered
    as a deploy key on the data repo)

- `docs/operations/sandbox-deploy.md` — manual procedure, rotation steps,
  image-visibility note, branch-switching.

Build-system fixes pulled in along the way:

- `packages/shared` was `noEmit: true`, so the api's compiled JS that
  imports from `@cfp/shared/schemas` couldn't be resolved at runtime in a
  production image. Switched the package to emit to dist/, updated its
  exports map, set its package.json `main`/`types` to dist.

- Root `build` script now enforces shared → api → web order (npm workspaces
  doesn't topo-sort by default). The previous `--workspaces --if-present`
  ran api before shared and exploded.

- `Dockerfile`: dropped the per-workspace `node_modules` COPYs — npm
  workspaces hoists every dep to the root, so those paths didn't exist.

Cluster state after `kubectl apply -k deploy/kustomize/overlays/sandbox`:
- Both SealedSecrets decrypted into Secrets
- Both PVCs bound (Linode block storage)
- Ingress provisioned on the existing LB (45.79.246.168)
- Deployment sitting in ImagePullBackOff waiting for the image push

Blocked on user-side scope refresh:
- `gh auth refresh -s write:packages` so docker push to GHCR succeeds
- Make the ghcr.io/codeforphilly/codeforphilly-rewrite package public
  (one-time, on the package settings page)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-overlay staging override is documented in the file's comment; sandbox now
issues real TLS certs from letsencrypt-prod (rate limit: 50 certs/week per
registered domain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the GitHub repo name (CodeForPhilly/codeforphilly-ng). Kustomize image
mappings + docs/operations/deploy.md updated; deployment.yaml itself + the
sandbox-deploy runbook follow in the next commits along with the
sandbox-iteration fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three fixes discovered standing up the first end-to-end sandbox deploy:

- **Entrypoint self-heals PVC residue.** Block-storage PVCs survive pod
  restarts and can carry non-empty/non-git content from earlier iterations.
  The entrypoint now wipes a non-empty data dir before re-cloning and adds
  the data path to git's safe.directory list (uid mismatch is common when
  prior pods ran as root and the new pod runs as uid 1000).

- **imagePullPolicy: Always for the sandbox image.** The `:sandbox` tag is
  mutable — re-pushed on each iteration. IfNotPresent would let kubelet
  reuse cached layers from an older digest. Production overlays should pin
  to a digest and set this back to IfNotPresent.

- **`docs/operations/sandbox-deploy.md`** + image rename to codeforphilly-ng.
  Doc now notes `--platform=linux/amd64` is required on Apple Silicon —
  Linode LKE nodes are amd64 and won't pull an arm64-only manifest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cluster has Envoy Gateway (gatewayClassName: eg) at 139.144.241.4; DNS for
*.sandbox.k8s.phl.io and next-v2.codeforphilly.org CNAMEs into it. The nginx
Ingress lived at a separate IP (45.79.246.168) that nothing currently points
at, so it was dead weight.

- Add Gateway + HTTPRoute to kustomize/base; cert-manager annotation on the
  Gateway provisions the codeforphilly-gw-tls Secret via letsencrypt-prod.
- Drop kustomize/base/ingress.yaml.
- Sandbox overlay patches both resources to next-v2.codeforphilly.org.
- docs/operations/sandbox-deploy.md updated to reflect the new ingress path
  and the production hostname.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds GITHUB_OAUTH_CLIENT_ID + GITHUB_OAUTH_CLIENT_SECRET alongside the
existing CFP_JWT_SIGNING_KEY + CFP_DATA_REMOTE. The sandbox OAuth App
"Code for Philly (sandbox)" is registered under the CodeForPhilly org with
callback https://next-v2.codeforphilly.org/api/auth/github/callback.

Verified end-to-end: GET /api/auth/github/start 302s to github.com/login/oauth/authorize
with the correct client_id, redirect_uri, scopes (read:user user:email), and
PKCE code_challenge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The session middleware (and a couple of sibling lookups in the auth and
SAML routes) was reaching for the Person via `store.public.people.queryFirst({ id })`,
which doesn't reflect commits made in the same process between transact and
the next sheet refresh. Newly-created users (the create-fresh OAuth path)
got a valid session JWT but the middleware's lookup returned null, so the
SPA's /api/auth/me saw `accountLevel: user` with `person: null` and rendered
as not-signed-in.

Switched all three call sites to `fastify.inMemoryState.people.get(id)` —
the canonical fast index already used by every other id→Person lookup in
the codebase (projects-members, projects-buzz). Same source of truth that
`stateApply.apply()` writes to immediately after the gitsheets transact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deploy plan shipped the deploy-key mount + GIT_SSH_COMMAND wiring but
never actually called `repo.startPushDaemon()`, so every commit produced by
store.transact landed on the local branch and was wiped the next time the
entrypoint refreshed against origin. Without this, accounts created via
OAuth survived only until the next pod restart.

- `openPublicStore` now returns `{ store, repo }` so the booted Repository
  handle is reachable from the plugin layer
- `bootStores` plumbs the repo through as `publicRepo`
- New `apps/api/src/plugins/push-daemon.ts` starts the daemon when
  `CFP_DATA_REMOTE` is set; emits push/retry/error logs; discriminates
  non-fast-forward (terminal) from transient (retried) failures per
  gitsheets 1.0.5+
- Fastify `onClose` hook stops the daemon on SIGTERM
- `fastify.pushDaemon` decoration (typed `PushDaemon | null`) lets future
  admin/status routes surface daemon state
- `CFP_DATA_BRANCH` lifted into the API env schema so the daemon pushes
  to the right branch (it was previously only consumed by the entrypoint)
- Scripts/tests using `openPublicStore` updated to destructure the new
  return shape and reference the exported `PublicStore` type directly

Verified: boot logs show `push-daemon started remote=origin branch=fixture`
on the running sandbox pod. The follow-up problem of the entrypoint's
unconditional `git reset --hard origin/<branch>` discarding local commits
between pod-terminate and the daemon's next push tick is separate and
will be addressed in a follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous entrypoint did `git fetch && git reset --hard origin/<branch>`
on every boot, which discarded any commits the API made locally that the
push daemon hadn't yet pushed (the structural cause of the post-OAuth
account-loss bug). With the push daemon now wired (5c98144 / #37), the
entrypoint can be smarter:

Boot-time states + actions, in order of cheapness:

- in sync with origin       → no-op
- behind origin             → `git merge --ff-only`
- ahead of origin           → push (push-daemon retries on failure)
- diverged + clean rebase   → rebase onto origin, then push
- diverged + conflicting    → escape hatch:
    1. abort rebase
    2. create `conflicts/<UTC-timestamp>` branch at the pre-rebase HEAD
    3. push that branch to origin (loudly logged for operators)
    4. hard-reset local to origin so the pod boots from known-good state

Safety properties:

- Work is never silently dropped; a conflict produces a named branch.
- Fetch failures (network blips) are non-fatal; entrypoint falls through
  to API start with whatever's locally available.
- Push failures during reconciliation are non-fatal; the running push
  daemon retries.
- Drops `--depth=1` from initial clone — rebase needs the merge-base.
  Existing shallow PVCs are unshallowed on first boot of this image.

Pseudonymous identity (api@users.noreply.codeforphilly.org) is used for the
entrypoint's own git operations; commit authors are preserved on rebase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`packages/shared` exports compiled output from `dist/` (since the Docker
build fix in f805a1a). Tests, type-check, and lint resolve `@cfp/shared`
via the exports map, which means `dist/` must exist before any of them
run — but CI was running type-check before build. Add an explicit
`npm run -w packages/shared build` step right after `npm ci`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@themightychris themightychris merged commit 930d035 into main May 18, 2026
1 check passed
@themightychris themightychris deleted the feat/sandbox-deploy branch May 18, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wire startPushDaemon() in API boot so commits actually propagate to the data remote deploy: stand up staging cluster + bucket and verify end-to-end

1 participant