feat: stand up sandbox cluster (Kustomize + Gateway API + push daemon)#61
Merged
Conversation
Iterating on a manual deploy to the CfP sandbox cluster (Linode LKE).
Replaces the Helm chart with a kustomize base + sandbox overlay per the
direction in the parked specs/architecture.md spec amendment.
What landed:
- `deploy/kustomize/base/` — namespace-agnostic manifests:
Deployment (single replica, Recreate strategy, non-root, readiness +
liveness probes), Service, two PVCs (linode-block-storage-retain),
ConfigMap with non-secret env, ServiceAccount, Ingress (nginx,
letsencrypt-staging issuer, TLS to a per-overlay host).
- `deploy/kustomize/overlays/sandbox/` — sandbox-specific Namespace,
ingress host patch (codeforphilly-rewrite.codeforphilly.sandbox.k8s.phl.io),
and SealedSecret manifests for:
- codeforphilly-secrets (CFP_JWT_SIGNING_KEY, CFP_DATA_REMOTE)
- codeforphilly-data-deploy-key (read-only ed25519 SSH key registered
as a deploy key on the data repo)
- `docs/operations/sandbox-deploy.md` — manual procedure, rotation steps,
image-visibility note, branch-switching.
Build-system fixes pulled in along the way:
- `packages/shared` was `noEmit: true`, so the api's compiled JS that
imports from `@cfp/shared/schemas` couldn't be resolved at runtime in a
production image. Switched the package to emit to dist/, updated its
exports map, set its package.json `main`/`types` to dist.
- Root `build` script now enforces shared → api → web order (npm workspaces
doesn't topo-sort by default). The previous `--workspaces --if-present`
ran api before shared and exploded.
- `Dockerfile`: dropped the per-workspace `node_modules` COPYs — npm
workspaces hoists every dep to the root, so those paths didn't exist.
Cluster state after `kubectl apply -k deploy/kustomize/overlays/sandbox`:
- Both SealedSecrets decrypted into Secrets
- Both PVCs bound (Linode block storage)
- Ingress provisioned on the existing LB (45.79.246.168)
- Deployment sitting in ImagePullBackOff waiting for the image push
Blocked on user-side scope refresh:
- `gh auth refresh -s write:packages` so docker push to GHCR succeeds
- Make the ghcr.io/codeforphilly/codeforphilly-rewrite package public
(one-time, on the package settings page)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-overlay staging override is documented in the file's comment; sandbox now issues real TLS certs from letsencrypt-prod (rate limit: 50 certs/week per registered domain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match the GitHub repo name (CodeForPhilly/codeforphilly-ng). Kustomize image mappings + docs/operations/deploy.md updated; deployment.yaml itself + the sandbox-deploy runbook follow in the next commits along with the sandbox-iteration fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three fixes discovered standing up the first end-to-end sandbox deploy: - **Entrypoint self-heals PVC residue.** Block-storage PVCs survive pod restarts and can carry non-empty/non-git content from earlier iterations. The entrypoint now wipes a non-empty data dir before re-cloning and adds the data path to git's safe.directory list (uid mismatch is common when prior pods ran as root and the new pod runs as uid 1000). - **imagePullPolicy: Always for the sandbox image.** The `:sandbox` tag is mutable — re-pushed on each iteration. IfNotPresent would let kubelet reuse cached layers from an older digest. Production overlays should pin to a digest and set this back to IfNotPresent. - **`docs/operations/sandbox-deploy.md`** + image rename to codeforphilly-ng. Doc now notes `--platform=linux/amd64` is required on Apple Silicon — Linode LKE nodes are amd64 and won't pull an arm64-only manifest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cluster has Envoy Gateway (gatewayClassName: eg) at 139.144.241.4; DNS for *.sandbox.k8s.phl.io and next-v2.codeforphilly.org CNAMEs into it. The nginx Ingress lived at a separate IP (45.79.246.168) that nothing currently points at, so it was dead weight. - Add Gateway + HTTPRoute to kustomize/base; cert-manager annotation on the Gateway provisions the codeforphilly-gw-tls Secret via letsencrypt-prod. - Drop kustomize/base/ingress.yaml. - Sandbox overlay patches both resources to next-v2.codeforphilly.org. - docs/operations/sandbox-deploy.md updated to reflect the new ingress path and the production hostname. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds GITHUB_OAUTH_CLIENT_ID + GITHUB_OAUTH_CLIENT_SECRET alongside the existing CFP_JWT_SIGNING_KEY + CFP_DATA_REMOTE. The sandbox OAuth App "Code for Philly (sandbox)" is registered under the CodeForPhilly org with callback https://next-v2.codeforphilly.org/api/auth/github/callback. Verified end-to-end: GET /api/auth/github/start 302s to github.com/login/oauth/authorize with the correct client_id, redirect_uri, scopes (read:user user:email), and PKCE code_challenge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The session middleware (and a couple of sibling lookups in the auth and
SAML routes) was reaching for the Person via `store.public.people.queryFirst({ id })`,
which doesn't reflect commits made in the same process between transact and
the next sheet refresh. Newly-created users (the create-fresh OAuth path)
got a valid session JWT but the middleware's lookup returned null, so the
SPA's /api/auth/me saw `accountLevel: user` with `person: null` and rendered
as not-signed-in.
Switched all three call sites to `fastify.inMemoryState.people.get(id)` —
the canonical fast index already used by every other id→Person lookup in
the codebase (projects-members, projects-buzz). Same source of truth that
`stateApply.apply()` writes to immediately after the gitsheets transact.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deploy plan shipped the deploy-key mount + GIT_SSH_COMMAND wiring but
never actually called `repo.startPushDaemon()`, so every commit produced by
store.transact landed on the local branch and was wiped the next time the
entrypoint refreshed against origin. Without this, accounts created via
OAuth survived only until the next pod restart.
- `openPublicStore` now returns `{ store, repo }` so the booted Repository
handle is reachable from the plugin layer
- `bootStores` plumbs the repo through as `publicRepo`
- New `apps/api/src/plugins/push-daemon.ts` starts the daemon when
`CFP_DATA_REMOTE` is set; emits push/retry/error logs; discriminates
non-fast-forward (terminal) from transient (retried) failures per
gitsheets 1.0.5+
- Fastify `onClose` hook stops the daemon on SIGTERM
- `fastify.pushDaemon` decoration (typed `PushDaemon | null`) lets future
admin/status routes surface daemon state
- `CFP_DATA_BRANCH` lifted into the API env schema so the daemon pushes
to the right branch (it was previously only consumed by the entrypoint)
- Scripts/tests using `openPublicStore` updated to destructure the new
return shape and reference the exported `PublicStore` type directly
Verified: boot logs show `push-daemon started remote=origin branch=fixture`
on the running sandbox pod. The follow-up problem of the entrypoint's
unconditional `git reset --hard origin/<branch>` discarding local commits
between pod-terminate and the daemon's next push tick is separate and
will be addressed in a follow-up commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous entrypoint did `git fetch && git reset --hard origin/<branch>` on every boot, which discarded any commits the API made locally that the push daemon hadn't yet pushed (the structural cause of the post-OAuth account-loss bug). With the push daemon now wired (5c98144 / #37), the entrypoint can be smarter: Boot-time states + actions, in order of cheapness: - in sync with origin → no-op - behind origin → `git merge --ff-only` - ahead of origin → push (push-daemon retries on failure) - diverged + clean rebase → rebase onto origin, then push - diverged + conflicting → escape hatch: 1. abort rebase 2. create `conflicts/<UTC-timestamp>` branch at the pre-rebase HEAD 3. push that branch to origin (loudly logged for operators) 4. hard-reset local to origin so the pod boots from known-good state Safety properties: - Work is never silently dropped; a conflict produces a named branch. - Fetch failures (network blips) are non-fatal; entrypoint falls through to API start with whatever's locally available. - Push failures during reconciliation are non-fatal; the running push daemon retries. - Drops `--depth=1` from initial clone — rebase needs the merge-base. Existing shallow PVCs are unshallowed on first boot of this image. Pseudonymous identity (api@users.noreply.codeforphilly.org) is used for the entrypoint's own git operations; commit authors are preserved on rebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`packages/shared` exports compiled output from `dist/` (since the Docker build fix in f805a1a). Tests, type-check, and lint resolve `@cfp/shared` via the exports map, which means `dist/` must exist before any of them run — but CI was running type-check before build. Add an explicit `npm run -w packages/shared build` step right after `npm ci`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stands up the first working end-to-end deploy of the rewrite onto the CodeForPhilly sandbox cluster (Linode LKE,
sandbox.k8s.phl.io). Live at https://next-v2.codeforphilly.org.This is the operational follow-up to the
deployplan (#35) — the work needed to take that plan's containerization + chart skeleton and actually get a pod running, holding state across restarts, and serving authenticated users behind TLS. Closes #36 and #37.What's new
Kustomize manifests (
deploy/kustomize/) replacing the Helm chart-based plan. Base + sandbox overlay. Per the spec PR #60 (which leads this one). The Helm chart underdeploy/charts/is left behind for now and will be retired in a follow-up PR alongside the GHA workflow flip.Gateway API instead of nginx Ingress. Cluster has Envoy Gateway (
gatewayClassName: eg); the sandbox subdomain (*.sandbox.k8s.phl.io) CNAMEs to it. Cert-manager provisions TLS vialetsencrypt-proddriven by an annotation on the Gateway. The previously-shippedIngressis removed.GitHub OAuth integration. Sandbox OAuth App registered under the CodeForPhilly org; credentials sealed into the namespace's env Secret alongside JWT signing key + data-remote URL. The signed-in user is now resolved from the in-memory state index (
fastify.inMemoryState.people.get(id)) rather than a slow sheet-level scan that didn't reflect in-process writes.Push daemon wired (closes #37).
repo.startPushDaemon()is now called in API boot whenCFP_DATA_REMOTEis set; emits push/retry/error logs and discriminates non-fast-forward (terminal) from transient failures. FastifyonClosestops the daemon on SIGTERM. The change rippled into a small but real return-type change foropenPublicStoreso the underlying Repository handle is reachable from the plugin layer.Smart entrypoint reconciliation. Replaces
git reset --hard origin/<branch>with state-aware logic:conflicts/<UTC-timestamp>branch to origin (loud for operators) and hard-reset local to known-goodWork is never silently dropped. Combined with the push daemon, a pod restart now reliably preserves data.
Iteration polish. GHCR package renamed to match the repo (
codeforphilly-ng);imagePullPolicy: Alwayson the mutable:sandboxtag;--platform=linux/amd64note in the runbook for Apple Silicon builds; entrypoint hardened against PVC residue and dubious-ownership uid mismatches.Verification
Stand-up validated end-to-end on the live cluster:
origin/fixture~1s after sign-in (verified in pod logs)tsc --noEmitclean on apps/apiWhat's deliberately deferred
deploy/charts/codeforphilly/— kept for now; retired in a follow-up alongside.github/workflows/deploy-{staging,production}.ymlflipping fromhelmtokubectl apply -k.Test plan
🤖 Generated with Claude Code