Skip to content

refactor(importer): use gitsheets transact + slug-keyed paths (closes #59)#64

Merged
themightychris merged 3 commits into
mainfrom
chore/import-via-gitsheets-transact
May 19, 2026
Merged

refactor(importer): use gitsheets transact + slug-keyed paths (closes #59)#64
themightychris merged 3 commits into
mainfrom
chore/import-via-gitsheets-transact

Conversation

@themightychris
Copy link
Copy Markdown
Member

Why

PR #57's importer bypassed gitsheets and wrote raw TOML files under legacyId-keyed paths. The stated reason was "clean diffs across re-runs when slugs rename" — but nothing in the runtime consumes that property, and it cost us:

  • a custom toToml() that has to mirror the lib's serializer
  • a hand-rolled wipeOwnedDirectories() + writeAllRecords() filesystem layer
  • a git cat-file --batch UUID-forward-read pre-pass
  • a separate file-rename reconciliation step before legacy-import could merge into fixture/main (#59)

How

Use the library as designed. One store.transact per snapshot:

await store.transact({ message, author, trailers }, async (tx) => {
  for (const sheetName of importerOwnedSheets) {
    await tx[sheetName].clear();
    for (const record of translated[sheetName]) {
      await tx[sheetName].upsert(record);
    }
  }
});
  • clear() empties each sheet; everything missing from the new set is gone — deletes captured for free
  • upsert() writes via the sheet's path template, producing slug-keyed paths that match the runtime spec
  • Zod validation runs on every record (the bare-git version skipped it once translation passed)
  • Pre-pass walks the previous snapshot via store.<sheet>.query() to preserve record UUIDs across re-runs. Composite-path sheets (memberships, tag-assignments) recover legacy IDs via uuid→legacyId reverse maps built from the simple-sheet pass.

Closes #59

legacy-import → fixture is now a plain git merge, no rename step. The agent's issue #59 about the runtime-reconciliation hump is obviated.

Upstream issues filed

Found two gitsheets-side perf/semantics gaps along the way:

The importer has a small post-transact workaround for #179 (compare tree-hash to parent, reset the ref if identical). The workaround comes out once gitsheets fixes #179 upstream.

Diff

  • Importer: apps/api/scripts/import-laddr/importer.ts (~800 lines deleted, ~400 added) — see the commit body for the full inventory of deletions
  • CLI: drops the --no-commit flag (gitsheets transact is atomic; --dry-run is the meaningful preview)
  • Tests: 22/22 passing on the rewritten test fixture (full .gitsheets/*.toml matrix instead of just people.toml); the "modified single record" idempotence test now asserts a slug-keyed path

Live verification

Run against codeforphilly.org (--page-size=10000 --delay-ms=0):

tags                   imported=897 skipped=120 errors=0
people                 imported=31440 skipped=6 errors=0
projects               imported=268 skipped=0 errors=0
project-memberships    imported=822 skipped=1 errors=0
project-updates        imported=504 skipped=13 errors=0
project-buzz           imported=32 skipped=81 errors=0
tag-assignments        imported=10114 skipped=184 errors=0

Tree shape on legacy-import (samples):

people/chris.toml
people/a-gemm.toml
projects/transit-tools.toml
tags/topic/transit.toml
tags/tech/javascript.toml
project-memberships/actualizeme/forager.toml
tag-assignments/person/<uuid>/<uuid>.toml

Test plan

  • npm run -w apps/api type-check — clean
  • npm run -w apps/api test -- tests/import-laddr.test.ts — 22/22
  • Full npm run -w apps/api test — running at PR creation; CI will confirm
  • End-to-end live run against codeforphilly.org produces a slug-keyed tree on legacy-import
  • Idempotent re-run on identical mock data → no new commit (post-transact tree-hash reset)

🤖 Generated with Claude Code

themightychris and others added 3 commits May 18, 2026 22:31
…eyed paths

The agent's first cut of the JSON importer (PR #57) bypassed gitsheets and
wrote raw TOML files under legacyId-keyed paths (people/${legacyId}.toml).
The stated reason was "clean diffs across re-runs when slugs rename" — a
property that nothing in the runtime actually consumes, in exchange for:

- a custom toToml() that has to match the lib's serializer
- a hand-rolled wipeOwnedDirectories() + writeAllRecords() filesystem
  layer that has to match gitsheets' tree expectations
- a `git cat-file --batch` UUID-forward-read pass to preserve UUIDs
- a separate reconciliation step (file rename pass) before legacy-import
  could merge into fixture/main (issue #59)

This rewrite uses the library as designed:

- `openPublicStore` opens the gitsheets repo handle for the data repo
- A single `store.transact` per snapshot: `await tx[sheet].clear()` empties
  each importer-owned sheet, then a loop of `await tx[sheet].upsert(record)`
  re-populates fresh. Deletes are captured implicitly — anything that was
  in the previous snapshot but isn't in the new one is gone after clear()
- Path templates from `.gitsheets/<name>.toml` produce slug-keyed paths
  directly (`people/${{ slug }}.toml`, `tags/${{ namespace }}/${{ slug }}.toml`,
  `project-memberships/${{ projectSlug }}/${{ personSlug }}.toml`, …),
  matching the runtime spec. legacy-import → fixture is now a plain
  `git merge`, no rename pass.
- Zod schema validation runs on every record (the bare-git version skipped
  it once translation passed)
- Pre-pass walks the previous snapshot via `store.<sheet>.query()` to
  preserve record UUIDs across re-runs. Composite-path sheets
  (project-memberships, tag-assignments) recover legacy IDs via
  uuid→legacyId reverse maps built from the simple-sheet pass.

Composite-path records (memberships, updates, buzz) need slug fields in
the upserted record for the path template to render; passthrough on the
schemas lets us add `projectSlug` / `personSlug` alongside the
schema-declared `projectId` / `personId`. Same pattern the API's runtime
write services already use (`withMembershipPath`).

Deletes:
- `toToml()` (replaced by gitsheets' own TOML serializer)
- `writeRecord()` / `writeAllRecords()` / `wipeOwnedDirectories()` (gitsheets
  Sheet operations replace them)
- `git cat-file --batch` (gitsheets `query()` replaces it)
- `ensureBranch` + `checkoutBranch` + `createImportCommit` (collapses to a
  small `ensureBranchCheckedOut` helper; transact does the commit)
- The `--no-commit` CLI flag (gitsheets transact is atomic; `--dry-run` is
  the only meaningful preview)

Known upstream limitations (filed):
- JarvusInnovations/gitsheets#178 — `Sheet#clear()` walks + deletes every
  child instead of pointing the subtree at git's empty-tree hash
- JarvusInnovations/gitsheets#179 — `Transaction#finalize()` commits any
  time `markMutated` is set, even when the resulting tree-hash equals the
  parent's. We work around #179 here with a post-transact tree-hash compare
  + `git update-ref` reset; clean fix is upstream.

Tests:
- All 22 importer tests pass against an updated `makeRepo()` fixture that
  seeds the full `.gitsheets/*.toml` matrix the gitsheets store requires
- The "modified single record" idempotence test now asserts a slug-keyed
  diff (`projects/transit-app.toml`) instead of the legacyId-keyed
  `projects/100.toml`
- File-content reads switched from working-tree `readFile` to
  `git show HEAD:<path>` — gitsheets commits to the object DB without
  touching the working tree, so the previous test pattern was reading the
  pre-commit state by accident

Live verification: import against codeforphilly.org (page-size=10000,
delay-ms=0) produces 44,077 records in a slug-keyed tree on legacy-import,
ready to merge into fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous resource limits (768Mi memory, default Node heap ~400Mi) were
fine for the 4-record fixture but OOM'd immediately on the real import:

  <--- Last few GCs --->
  Mark-Compact (reduce) 383.2 (390.9) -> 382.7 (391.4) MB
  FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

Two changes, both rooted in the in-memory model the API uses (full public
dataset + secondary indices + FTS all sit in RAM by design):

- Bump container memory: requests 384→768 Mi, limits 768→2048 Mi. Headroom
  for ~31k people, 268 projects, 10k tag-assignments, FTS index, plus
  ephemeral per-request work.
- Set NODE_OPTIONS=--max-old-space-size=1536 in the env ConfigMap. Node's
  default old-space cap in containers is ~25% of memory (kernel-detected,
  capped low). Without raising it, even a generous container limit doesn't
  matter — the pod OOMs at the heap ceiling before it touches the cgroup
  limit. 1.5 Gi leaves ~25% headroom under the 2 Gi container ceiling for
  the runtime's own ephemeral allocations.

Live-verified on sandbox: pod boots in ~20s, 269 projects + ~30k people
loaded, FTS ready, facet aggregations correct, readiness probe green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sed imports

gitsheets v1.3.1 ships fixes for the two upstream issues this PR's importer
worked around:

- JarvusInnovations/gitsheets#178 — `Sheet#clear()` now uses git's empty-tree
  hash via hologit's `TreeObject.clearChildren()`. Constant-time instead of
  walking + `deleteChild` per entry. Snapshot importers feel this most.
- JarvusInnovations/gitsheets#179 — `Transaction#finalize` now compares
  resulting tree-hash to parent's tree-hash; no commit is created when they
  match. The post-transact `git update-ref` workaround in the importer can
  go away.

Removed:
- The post-transact tree-hash comparison + ref reset (no longer needed)
- Unused `IdMaps` type import (left over from the rewrite)
- Unused `readFile` import in the test (replaced by `git show HEAD:<path>`)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@themightychris themightychris merged commit c56224a into main May 19, 2026
1 check passed
@themightychris themightychris deleted the chore/import-via-gitsheets-transact branch May 19, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ops: push legacy-import to codeforphilly-data and merge into main

1 participant