claude-backfill: recover repo identity from cwd for pre-0032 sessions#131
Conversation
Sessions recorded before LLP 0032's git capture landed have a session-context sidecar (and transcript) with no remote/HEAD/root, so their Session -in-> Repo edge never mints and their enrichment floats unattributed. The one repo signal they carry is the cwd on each transcript line, so the backfill now runs git in that cwd at import time (git_repo.js deriveRepoFromCwd) when the record supplies no remote. Derives git_remote + repo_root only, NEVER head_sha: rev-parse HEAD today reports the repo's current HEAD, not the commit the session sat on, so a recovered sha would mint a wrong Commit node. The headline session<->repo join needs only the remote, and a toplevel is stable across commits. - git_repo.js: deriveRepoFromCwd, redacts credential userinfo, injectable git seam for hermetic tests. - backfill.js: resolve cwd as record.cwd ?? transcript cwd; derive when the record lacks a remote; per-cwd memoized so one git probe runs per distinct directory, not per session. - transcripts.js/types.d.ts: parse cwd off each transcript line. - LLP 0032 §capture: document backfill recovery + the part_id-dedup caveat (refreshing pre-capture history needs the old rows dropped first). Validated against real repos (hyparam/hypaware incl. a live worktree, hypaware-server) and a full rehearsal backfill of ~/.claude history: 694/774 on-disk sessions recovered a correct git_remote (was 0/1147). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWrfrpzfh9isrWV7umZLzK
Dual-agent review —
|
| Source | Finding (severity, evidence) | Intersects |
|---|---|---|
| Claude | Canonical record-present/no-remote recovery untested (major, backfill.js:336,351) | Targets (projectedExchangeFromEntries), Risks bullet 1 |
| Claude | record repo_root not guarded against derived overwrite (minor, backfill.js:354) |
Risks bullet 3 (cwd-reuse / derived repo_root) |
- Codex review: (no findings reported)
Claude review
Claude review
Five independent lenses ran (guidance compliance, shallow bug scan, historical
context, contract & callers, comments & tests). Four found nothing material:
code style/JSDoc/@ref rules all satisfied and every anchor resolves; the
sync→async change to projectedExchangeFromEntries has its one (module-private)
caller awaiting it; the new required cwd field on TranscriptEntry is set by
the only constructor (transcriptEntryFromRow) and tsc --noEmit is clean; the
per-cwd Promise memo is correctly wired (one git probe per distinct cwd, caches
the in-flight Promise); git_repo.js faithfully mirrors the live hook's
gitLine/redaction with a deliberate 2000ms timeout and no head_sha; and the
LLP 0032 part_id-dedupe caveat is accurate. Two test-coverage gaps survived the
≥80 filter.
Canonical pre-0032 recovery (record present, cwd, no remote) is untested
- Severity: major
- Confidence: 88
- Evidence: test/plugins/claude-backfill.test.js:622 (only recovery test uses NO record at all); hypaware-core/plugins-workspace/claude/src/backfill.js:336 (
const cwd = record?.cwd ?? transcriptCwd) and :351 (derivation guarded on that cwd) - Why it matters: LLP 0032 defines the primary recovery target as a session whose sidecar record EXISTS but carries cwd and no remote/HEAD/root; the new tests only cover "no record at all → derive from transcriptCwd", so the headline branch — derive
git_remotefromrecord.cwdwhile the record's other fields are present — has zero coverage, and a future change to therecord?.cwd ?? transcriptCwdprecedence or the derivation guard would pass all tests while silently breaking pre-0032 recovery. - Suggested fix: Add a backfill test that writes a session-context record with
cwd/git_branchbut nogit_remote/head_sha/repo_root, asserting (i)deriveRepois called withrecord.cwd(not the transcript cwd) and (ii) the recoveredgit_remotelands on the exchange.
No test guards "record repo_root is not overwritten by derived repo_root"
- Severity: minor
- Confidence: 82
- Evidence: hypaware-core/plugins-workspace/claude/src/backfill.js:354 (
if (derived.repo_root && !exchange.repo_root) exchange.repo_root = derived.repo_root) - Why it matters: The
&& !exchange.repo_rootguard preserves the record's authoritative toplevel when only the remote is recovered; dropping it would let a derived (possibly worktree-shifted)repo_rootclobber the record's value, and no test would fail — the "record wins" test short-circuits derivation entirely because the record already has a remote. - Suggested fix: Add/extend a test where the record supplies
repo_rootbut nogit_remoteandderiveReporeturns a remote plus a DIFFERENTrepo_root; assert the derived remote lands whileexchange.repo_rootkeeps the record's value.
Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr131/dual-review/pr-131
Address two test-coverage gaps from the dual-review of #131. Both are test-only; the production logic is correct as written. 1. (major) The canonical pre-0032 shape — a session-context record that EXISTS with a cwd but no git_remote/head_sha/repo_root — had no test; the existing recovery test uses no record at all, so it only exercised the transcriptCwd fallback, not the `record?.cwd ?? transcriptCwd` precedence (backfill.js:336) the headline scenario needs. The new test makes the record's cwd differ from the transcript line's cwd and asserts deriveRepo is called with the record's cwd and the recovered git_remote lands. 2. (minor) The `&& !exchange.repo_root` guard (backfill.js:354) — a record-supplied repo_root must not be clobbered by a derived one — had no test. The new test supplies repo_root but no git_remote, returns a different repo_root from deriveRepo, and asserts the derived remote lands while the record's repo_root is preserved. Verified by mutation: flipping the cwd precedence fails (1); dropping the repo_root guard fails (2). Full suite green (1302 pass), tsc clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
Claude sessions recorded before the LLP 0032 git capture landed (#125) carry no
git_remote— every such row isschema_version6, so theirSession -in-> Repoedge never mints and any enrichment produced from them floats unattributed. In the live graph this was the whole Claude corpus: 0 of 1,147 Claude sessions had a remote, leaving only 264 / 1,368 (19%) enrichment items repo-linked (all from Codex).The 0032 capture code exists and is correct — it just postdates these sessions, and their session-context sidecars (and transcripts) never carried the git fields.
Fix
The one repo signal a pre-0032 session does carry is the
cwdstamped on every Claude transcript line. The backfill provider now runs git in that cwd at import time to recover identity when the session-context record supplies no remote.claude/src/git_repo.js(new) —deriveRepoFromCwd(cwd):git config remote.origin.url+rev-parse --show-toplevel, credential userinfo redacted at ingress (LLP 0032 §remote-redaction). Injectable git seam for hermetic tests.claude/src/backfill.js— resolve cwd asrecord.cwd ?? transcript cwd; derive when the record lacks a remote; per-cwd memoized so a backfill over thousands of sessions runs one git probe per distinct directory, not per session.claude/src/transcripts.js/types.d.ts— parsecwdoff each transcript line.Derives
git_remote+repo_rootonly — neverhead_sharev-parse HEADnow reports the repo's current HEAD, not the commit the historical session sat on, so a recovered sha would mint a wrongCommitnode. The headline session↔repo join needs only the remote, and a repo's toplevel is stable across commits — both are safe to derive after the fact; the sha is not. A cwd that no longer resolves to a git repo (deleted worktree, moved checkout) recovers nothing and falls back to absoluteFilekeys.Going-forward behaviour
This is a permanent code path, not a one-off data patch: future
hyp backfill clauderuns attribute newly-imported sessions from their cwd. (Live capture going forward still depends on a 0032-aware daemon actually running the hook — separate, already-merged in #125.)One operational note, documented in the LLP: refreshing already-imported history requires dropping the old
ai_gateway_messagesrows first, because the materializer's pre-writepart_iddedupe drops byte-identical re-imports.Validation
npm testgreen;tsc --noEmitclean;ref-checkclean.claude-git-repo.test.js(5) + 3 inclaude-backfill.test.js— record wins over derivation, cwd-fallback fills remote/root, head_sha never derived, per-cwd memoization, credential redaction, graceful degrade.~/.claudehistory: 694 / 774 on-disk sessions recovered a correctgit_remote(was 0), resolving to hyparam/hypaware (incl. a live worktree), stele, hypaware-server, icebird, … After applying to the live cache + re-projecting, repo-linked enrichment items went 264 → 1,153 (19% → 84%) with the total item count unchanged (no re-enrichment).🤖 Generated with Claude Code