Skip to content

claude-backfill: recover repo identity from cwd for pre-0032 sessions#131

Merged
philcunliffe merged 2 commits into
masterfrom
claude-backfill-cwd-repo-recovery
Jun 22, 2026
Merged

claude-backfill: recover repo identity from cwd for pre-0032 sessions#131
philcunliffe merged 2 commits into
masterfrom
claude-backfill-cwd-repo-recovery

Conversation

@philcunliffe

Copy link
Copy Markdown
Contributor

Problem

Claude sessions recorded before the LLP 0032 git capture landed (#125) carry no git_remote — every such row is schema_version 6, so their Session -in-> Repo edge never mints and any enrichment produced from them floats unattributed. In the live graph this was the whole Claude corpus: 0 of 1,147 Claude sessions had a remote, leaving only 264 / 1,368 (19%) enrichment items repo-linked (all from Codex).

The 0032 capture code exists and is correct — it just postdates these sessions, and their session-context sidecars (and transcripts) never carried the git fields.

Fix

The one repo signal a pre-0032 session does carry is the cwd stamped on every Claude transcript line. The backfill provider now runs git in that cwd at import time to recover identity when the session-context record supplies no remote.

  • claude/src/git_repo.js (new) — deriveRepoFromCwd(cwd): git config remote.origin.url + rev-parse --show-toplevel, credential userinfo redacted at ingress (LLP 0032 §remote-redaction). Injectable git seam for hermetic tests.
  • claude/src/backfill.js — resolve cwd as record.cwd ?? transcript cwd; derive when the record lacks a remote; per-cwd memoized so a backfill over thousands of sessions runs one git probe per distinct directory, not per session.
  • claude/src/transcripts.js / types.d.ts — parse cwd off each transcript line.
  • LLP 0032 §capture — new "Backfill recovery for pre-capture sessions" subsection.

Derives git_remote + repo_root only — never head_sha

rev-parse HEAD now reports the repo's current HEAD, not the commit the historical session sat on, so a recovered sha would mint a wrong Commit node. The headline session↔repo join needs only the remote, and a repo's toplevel is stable across commits — both are safe to derive after the fact; the sha is not. A cwd that no longer resolves to a git repo (deleted worktree, moved checkout) recovers nothing and falls back to absolute File keys.

Going-forward behaviour

This is a permanent code path, not a one-off data patch: future hyp backfill claude runs attribute newly-imported sessions from their cwd. (Live capture going forward still depends on a 0032-aware daemon actually running the hook — separate, already-merged in #125.)

One operational note, documented in the LLP: refreshing already-imported history requires dropping the old ai_gateway_messages rows first, because the materializer's pre-write part_id dedupe drops byte-identical re-imports.

Validation

  • npm test green; tsc --noEmit clean; ref-check clean.
  • New tests: claude-git-repo.test.js (5) + 3 in claude-backfill.test.js — record wins over derivation, cwd-fallback fills remote/root, head_sha never derived, per-cwd memoization, credential redaction, graceful degrade.
  • Real-corpus rehearsal over ~/.claude history: 694 / 774 on-disk sessions recovered a correct git_remote (was 0), resolving to hyparam/hypaware (incl. a live worktree), stele, hypaware-server, icebird, … After applying to the live cache + re-projecting, repo-linked enrichment items went 264 → 1,153 (19% → 84%) with the total item count unchanged (no re-enrichment).

🤖 Generated with Claude Code

Sessions recorded before LLP 0032's git capture landed have a
session-context sidecar (and transcript) with no remote/HEAD/root, so
their Session -in-> Repo edge never mints and their enrichment floats
unattributed. The one repo signal they carry is the cwd on each
transcript line, so the backfill now runs git in that cwd at import time
(git_repo.js deriveRepoFromCwd) when the record supplies no remote.

Derives git_remote + repo_root only, NEVER head_sha: rev-parse HEAD today
reports the repo's current HEAD, not the commit the session sat on, so a
recovered sha would mint a wrong Commit node. The headline session<->repo
join needs only the remote, and a toplevel is stable across commits.

- git_repo.js: deriveRepoFromCwd, redacts credential userinfo, injectable
  git seam for hermetic tests.
- backfill.js: resolve cwd as record.cwd ?? transcript cwd; derive when
  the record lacks a remote; per-cwd memoized so one git probe runs per
  distinct directory, not per session.
- transcripts.js/types.d.ts: parse cwd off each transcript line.
- LLP 0032 §capture: document backfill recovery + the part_id-dedup
  caveat (refreshing pre-capture history needs the old rows dropped first).

Validated against real repos (hyparam/hypaware incl. a live worktree,
hypaware-server) and a full rehearsal backfill of ~/.claude history:
694/774 on-disk sessions recovered a correct git_remote (was 0/1147).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AWrfrpzfh9isrWV7umZLzK
@philcunliffe

Copy link
Copy Markdown
Contributor Author

Dual-agent review — request_changes

  • Verdict: request_changes
  • Risk class: medium
  • Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up

Advisory only: no merge was attempted.

⚠️ Codex was unavailable this run (local gateway stream disconnect at 127.0.0.1:8787); the verdict is computed on the Claude review alone.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Claude Canonical record-present/no-remote recovery untested (major, backfill.js:336,351) Targets (projectedExchangeFromEntries), Risks bullet 1
Claude record repo_root not guarded against derived overwrite (minor, backfill.js:354) Risks bullet 3 (cwd-reuse / derived repo_root)
  • Codex review: (no findings reported)
Claude review

Claude review

Five independent lenses ran (guidance compliance, shallow bug scan, historical
context, contract & callers, comments & tests). Four found nothing material:
code style/JSDoc/@ref rules all satisfied and every anchor resolves; the
sync→async change to projectedExchangeFromEntries has its one (module-private)
caller awaiting it; the new required cwd field on TranscriptEntry is set by
the only constructor (transcriptEntryFromRow) and tsc --noEmit is clean; the
per-cwd Promise memo is correctly wired (one git probe per distinct cwd, caches
the in-flight Promise); git_repo.js faithfully mirrors the live hook's
gitLine/redaction with a deliberate 2000ms timeout and no head_sha; and the
LLP 0032 part_id-dedupe caveat is accurate. Two test-coverage gaps survived the
≥80 filter.

Canonical pre-0032 recovery (record present, cwd, no remote) is untested

  • Severity: major
  • Confidence: 88
  • Evidence: test/plugins/claude-backfill.test.js:622 (only recovery test uses NO record at all); hypaware-core/plugins-workspace/claude/src/backfill.js:336 (const cwd = record?.cwd ?? transcriptCwd) and :351 (derivation guarded on that cwd)
  • Why it matters: LLP 0032 defines the primary recovery target as a session whose sidecar record EXISTS but carries cwd and no remote/HEAD/root; the new tests only cover "no record at all → derive from transcriptCwd", so the headline branch — derive git_remote from record.cwd while the record's other fields are present — has zero coverage, and a future change to the record?.cwd ?? transcriptCwd precedence or the derivation guard would pass all tests while silently breaking pre-0032 recovery.
  • Suggested fix: Add a backfill test that writes a session-context record with cwd/git_branch but no git_remote/head_sha/repo_root, asserting (i) deriveRepo is called with record.cwd (not the transcript cwd) and (ii) the recovered git_remote lands on the exchange.

No test guards "record repo_root is not overwritten by derived repo_root"

  • Severity: minor
  • Confidence: 82
  • Evidence: hypaware-core/plugins-workspace/claude/src/backfill.js:354 (if (derived.repo_root && !exchange.repo_root) exchange.repo_root = derived.repo_root)
  • Why it matters: The && !exchange.repo_root guard preserves the record's authoritative toplevel when only the remote is recovered; dropping it would let a derived (possibly worktree-shifted) repo_root clobber the record's value, and no test would fail — the "record wins" test short-circuits derivation entirely because the record already has a remote.
  • Suggested fix: Add/extend a test where the record supplies repo_root but no git_remote and deriveRepo returns a remote plus a DIFFERENT repo_root; assert the derived remote lands while exchange.repo_root keeps the record's value.

Reports: /Users/phil/workspace/hypaware/.git/worktrees/dual-review-pr131/dual-review/pr-131

Address two test-coverage gaps from the dual-review of #131. Both are
test-only; the production logic is correct as written.

1. (major) The canonical pre-0032 shape — a session-context record that
   EXISTS with a cwd but no git_remote/head_sha/repo_root — had no test;
   the existing recovery test uses no record at all, so it only exercised
   the transcriptCwd fallback, not the `record?.cwd ?? transcriptCwd`
   precedence (backfill.js:336) the headline scenario needs. The new test
   makes the record's cwd differ from the transcript line's cwd and
   asserts deriveRepo is called with the record's cwd and the recovered
   git_remote lands.

2. (minor) The `&& !exchange.repo_root` guard (backfill.js:354) — a
   record-supplied repo_root must not be clobbered by a derived one — had
   no test. The new test supplies repo_root but no git_remote, returns a
   different repo_root from deriveRepo, and asserts the derived remote
   lands while the record's repo_root is preserved.

Verified by mutation: flipping the cwd precedence fails (1); dropping the
repo_root guard fails (2). Full suite green (1302 pass), tsc clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@philcunliffe philcunliffe merged commit 6704bac into master Jun 22, 2026
6 checks passed
@philcunliffe philcunliffe deleted the claude-backfill-cwd-repo-recovery branch June 22, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant