refactor(importer): use gitsheets transact + slug-keyed paths (closes #59)#64
Merged
Merged
Conversation
…eyed paths The agent's first cut of the JSON importer (PR #57) bypassed gitsheets and wrote raw TOML files under legacyId-keyed paths (people/${legacyId}.toml). The stated reason was "clean diffs across re-runs when slugs rename" — a property that nothing in the runtime actually consumes, in exchange for: - a custom toToml() that has to match the lib's serializer - a hand-rolled wipeOwnedDirectories() + writeAllRecords() filesystem layer that has to match gitsheets' tree expectations - a `git cat-file --batch` UUID-forward-read pass to preserve UUIDs - a separate reconciliation step (file rename pass) before legacy-import could merge into fixture/main (issue #59) This rewrite uses the library as designed: - `openPublicStore` opens the gitsheets repo handle for the data repo - A single `store.transact` per snapshot: `await tx[sheet].clear()` empties each importer-owned sheet, then a loop of `await tx[sheet].upsert(record)` re-populates fresh. Deletes are captured implicitly — anything that was in the previous snapshot but isn't in the new one is gone after clear() - Path templates from `.gitsheets/<name>.toml` produce slug-keyed paths directly (`people/${{ slug }}.toml`, `tags/${{ namespace }}/${{ slug }}.toml`, `project-memberships/${{ projectSlug }}/${{ personSlug }}.toml`, …), matching the runtime spec. legacy-import → fixture is now a plain `git merge`, no rename pass. - Zod schema validation runs on every record (the bare-git version skipped it once translation passed) - Pre-pass walks the previous snapshot via `store.<sheet>.query()` to preserve record UUIDs across re-runs. Composite-path sheets (project-memberships, tag-assignments) recover legacy IDs via uuid→legacyId reverse maps built from the simple-sheet pass. Composite-path records (memberships, updates, buzz) need slug fields in the upserted record for the path template to render; passthrough on the schemas lets us add `projectSlug` / `personSlug` alongside the schema-declared `projectId` / `personId`. Same pattern the API's runtime write services already use (`withMembershipPath`). Deletes: - `toToml()` (replaced by gitsheets' own TOML serializer) - `writeRecord()` / `writeAllRecords()` / `wipeOwnedDirectories()` (gitsheets Sheet operations replace them) - `git cat-file --batch` (gitsheets `query()` replaces it) - `ensureBranch` + `checkoutBranch` + `createImportCommit` (collapses to a small `ensureBranchCheckedOut` helper; transact does the commit) - The `--no-commit` CLI flag (gitsheets transact is atomic; `--dry-run` is the only meaningful preview) Known upstream limitations (filed): - JarvusInnovations/gitsheets#178 — `Sheet#clear()` walks + deletes every child instead of pointing the subtree at git's empty-tree hash - JarvusInnovations/gitsheets#179 — `Transaction#finalize()` commits any time `markMutated` is set, even when the resulting tree-hash equals the parent's. We work around #179 here with a post-transact tree-hash compare + `git update-ref` reset; clean fix is upstream. Tests: - All 22 importer tests pass against an updated `makeRepo()` fixture that seeds the full `.gitsheets/*.toml` matrix the gitsheets store requires - The "modified single record" idempotence test now asserts a slug-keyed diff (`projects/transit-app.toml`) instead of the legacyId-keyed `projects/100.toml` - File-content reads switched from working-tree `readFile` to `git show HEAD:<path>` — gitsheets commits to the object DB without touching the working tree, so the previous test pattern was reading the pre-commit state by accident Live verification: import against codeforphilly.org (page-size=10000, delay-ms=0) produces 44,077 records in a slug-keyed tree on legacy-import, ready to merge into fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous resource limits (768Mi memory, default Node heap ~400Mi) were fine for the 4-record fixture but OOM'd immediately on the real import: <--- Last few GCs ---> Mark-Compact (reduce) 383.2 (390.9) -> 382.7 (391.4) MB FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory Two changes, both rooted in the in-memory model the API uses (full public dataset + secondary indices + FTS all sit in RAM by design): - Bump container memory: requests 384→768 Mi, limits 768→2048 Mi. Headroom for ~31k people, 268 projects, 10k tag-assignments, FTS index, plus ephemeral per-request work. - Set NODE_OPTIONS=--max-old-space-size=1536 in the env ConfigMap. Node's default old-space cap in containers is ~25% of memory (kernel-detected, capped low). Without raising it, even a generous container limit doesn't matter — the pod OOMs at the heap ceiling before it touches the cgroup limit. 1.5 Gi leaves ~25% headroom under the 2 Gi container ceiling for the runtime's own ephemeral allocations. Live-verified on sandbox: pod boots in ~20s, 269 projects + ~30k people loaded, FTS ready, facet aggregations correct, readiness probe green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sed imports gitsheets v1.3.1 ships fixes for the two upstream issues this PR's importer worked around: - JarvusInnovations/gitsheets#178 — `Sheet#clear()` now uses git's empty-tree hash via hologit's `TreeObject.clearChildren()`. Constant-time instead of walking + `deleteChild` per entry. Snapshot importers feel this most. - JarvusInnovations/gitsheets#179 — `Transaction#finalize` now compares resulting tree-hash to parent's tree-hash; no commit is created when they match. The post-transact `git update-ref` workaround in the importer can go away. Removed: - The post-transact tree-hash comparison + ref reset (no longer needed) - Unused `IdMaps` type import (left over from the rewrite) - Unused `readFile` import in the test (replaced by `git show HEAD:<path>`) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
PR #57's importer bypassed gitsheets and wrote raw TOML files under legacyId-keyed paths. The stated reason was "clean diffs across re-runs when slugs rename" — but nothing in the runtime consumes that property, and it cost us:
toToml()that has to mirror the lib's serializerwipeOwnedDirectories()+writeAllRecords()filesystem layergit cat-file --batchUUID-forward-read pre-passlegacy-importcould merge intofixture/main(#59)How
Use the library as designed. One
store.transactper snapshot:clear()empties each sheet; everything missing from the new set is gone — deletes captured for freeupsert()writes via the sheet's path template, producing slug-keyed paths that match the runtime specstore.<sheet>.query()to preserve record UUIDs across re-runs. Composite-path sheets (memberships, tag-assignments) recover legacy IDs via uuid→legacyId reverse maps built from the simple-sheet pass.Closes #59
legacy-import → fixtureis now a plaingit merge, no rename step. The agent's issue #59 about the runtime-reconciliation hump is obviated.Upstream issues filed
Found two gitsheets-side perf/semantics gaps along the way:
Sheet#clear()walks +deleteChildper entry instead of pointing the subtree at git's empty-tree hashTransaction#finalize()commits whenevermarkMutatedis set, even when the resulting tree-hash equals the parent'sThe importer has a small post-
transactworkaround for #179 (compare tree-hash to parent, reset the ref if identical). The workaround comes out once gitsheets fixes #179 upstream.Diff
apps/api/scripts/import-laddr/importer.ts(~800 lines deleted, ~400 added) — see the commit body for the full inventory of deletions--no-commitflag (gitsheets transact is atomic;--dry-runis the meaningful preview).gitsheets/*.tomlmatrix instead of justpeople.toml); the "modified single record" idempotence test now asserts a slug-keyed pathLive verification
Run against codeforphilly.org (
--page-size=10000 --delay-ms=0):Tree shape on
legacy-import(samples):Test plan
npm run -w apps/api type-check— cleannpm run -w apps/api test -- tests/import-laddr.test.ts— 22/22npm run -w apps/api test— running at PR creation; CI will confirmlegacy-import🤖 Generated with Claude Code