Skip to content

context-graph-enrich: bounded --since curate + crash-resumable Batch job#130

Merged
philcunliffe merged 2 commits into
masterfrom
enrich-backfill-since-window
Jun 22, 2026
Merged

context-graph-enrich: bounded --since curate + crash-resumable Batch job#130
philcunliffe merged 2 commits into
masterfrom
enrich-backfill-since-window

Conversation

@philcunliffe

Copy link
Copy Markdown
Contributor

Two operability fixes for the hyp enrich backfill curate path, both aimed at making a deliberate frontier-curate run safe to drive by hand. Landed while doing exactly that — a scoped 2-week curate over the live cache (496 prospects → 335 committed / 15 merged / 161 rejected, then projected into the graph).

--since / --dry-run (bounded curate window)

An unbounded cold-backfill pool is intractable: curate does a per-prospect vector recall plus greedy O(n²) cosine clustering, so thousands of pending prospects peg CPU and never submit (an earlier ~2,800-prospect attempt ran 42 min and never submitted).

  • --since <YYYY-MM-DD> scopes the pending pool to sessions active on or after that day (a few hundred prospects for a two-week slice).
  • --dry-run reports the scoped pool + cluster count without submitting.
  • The bound is a read-side filter on selectPending, not a mutation: out-of-window prospects stay pending for a later, separately-scoped run rather than being deleted or skip-drained.
  • Filter is config-driven (anchor_key_column / timestamp_column), so it stays source-agnostic.

Crash-resumable Batch job (persist-and-resume)

runCurateBatch now persists the in-flight job (batch id + cluster→prospect map) to state.curate_job on submit and clears it after results are committed, mirroring the daemon submit-and-collect path. A crash between submit and collect is now recoverable: re-running resumes the same batch instead of re-submitting (and re-billing) it.

Tests / docs

  • Parse-level tests for the new flags (valid/=-form/malformed/missing-value/--propose-only conflict).
  • LLP 0028 (two-regimes) documents the bounded-window lever.
  • npm test: 1298 pass; tsc --noEmit clean.

🤖 Generated with Claude Code

https://claude.ai/code/session_01AWrfrpzfh9isrWV7umZLzK

Two operability fixes for the `hyp enrich backfill` curate path, both aimed at
making a deliberate frontier-curate run safe to drive by hand.

--since / --dry-run (bounded curate window)
  An unbounded cold-backfill pool is intractable: curate does a per-prospect
  vector recall plus greedy O(n^2) cosine clustering, so thousands of pending
  prospects peg CPU and never submit. `--since <YYYY-MM-DD>` scopes the pending
  pool to sessions active on or after that day (a few hundred prospects for a
  two-week slice), and `--dry-run` reports the scoped pool + cluster count
  without submitting. The bound is a read-side filter on selectPending, NOT a
  mutation: out-of-window prospects stay pending for a later, separately-scoped
  run rather than being deleted or skip-drained. Filter is config-driven
  (anchor_key_column / timestamp_column), so it stays source-agnostic.

Crash-resumable Batch job (persist-and-resume)
  runCurateBatch now persists the in-flight job (batch id + cluster->prospect
  map) to state.curate_job on submit and clears it after results are committed,
  mirroring the daemon submit-and-collect path. A crash between submit and
  collect is now recoverable: re-running resumes the same batch instead of
  re-submitting (and re-billing) it.

Adds parse-level tests for the new flags; updates LLP 0028 (two-regimes) to
document the bounded-window lever.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AWrfrpzfh9isrWV7umZLzK
@philcunliffe

Copy link
Copy Markdown
Contributor Author

Dual-agent review — request_changes

  • Verdict: request_changes
  • Risk class: medium
  • Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up
  • Reviewers: Claude (3 findings: 1 major, 2 minor); Codex produced no output (failed 3× on a local proxy 127.0.0.1:8787 stream disconnect — infra, not a review signal).

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source Finding (severity, evidence) Intersects
Claude Backfill/daemon share un-namespaced curate_job (major, batch.js:69-74,97) Concurrency surface + Risk #1
Claude Resume reports pending: 0 → misleading N/0 (minor, batch.js:73) Risk #3 (return shape)
Claude New branches untested (minor, batch.test.js / inWindowSessions) Risk #2 (coverage)
Codex (no output — codex failed 3× on local proxy 127.0.0.1:8787 stream disconnect)
  • Codex review: (no findings reported)
Claude review

Claude review

Backfill resume and the daemon share one un-namespaced curate_job slot

  • Severity: major
  • Confidence: 82
  • Evidence: hypaware-core/plugins-workspace/context-graph-enrich/src/batch.js:69-74 (and the persist at :97), vs the daemon path at :138-172 / :319-323
  • Why it matters: This PR makes the backfill runCurateBatch both write (:97) and resume/clear (:69-74, :114) the same single state.curate_job slot the daemon's ongoing regime already owns (submitCurateJob/collectCurateJob/tick). The resume branch keys off any in-flight job with no source discriminator, so with the daemon curate source enabled (curate.enabled) a manual hyp enrich backfill --since … will (a) silently ignore --since — if a daemon job is in flight it collects that job and returns at :73 without ever submitting the scoped batch, after already printing "scoped to --since … → N in-window session(s)"; and (b) let the daemon tick() collect/clear the backfill's job (or vice-versa) — a concurrent double-collect. It is not data-loss or double-billing (whichever path collects commits the results, and downstream dedups), but it is a real, newly-introduced cross-regime surprise on a graph-mutating, money-adjacent operation. Before this PR runCurateBatch neither read nor wrote curate_job, so the collision surface is new.
  • Suggested fix: Tag the persisted job with its origin (e.g. curate_job.source: 'backfill' | 'daemon') and only resume in runCurateBatch when inflight.source === 'backfill'; or, minimally, document and guard that a manual backfill curate must not run while the daemon curate source is enabled.

Resume path reports pending: 0, yielding a misleading N/0 processed line

  • Severity: minor
  • Confidence: 85
  • Evidence: hypaware-core/plugins-workspace/context-graph-enrich/src/batch.js:73; printed at commands.js:101-103
  • Why it matters: On the crash-recovery resume branch the return hardcodes pending: 0, so the caller prints ${cr.processed}/${cr.pending} processed as e.g. 42/0 processed over 7 cluster(s)processed exceeding pending reads as nonsense on exactly the rare path (post-crash recovery) where the operator most needs a trustworthy summary. The actual enrichment is correct; only the reported count is wrong.
  • Suggested fix: On resume either recompute the scoped pending count (selectPending(runtime, { anchorKeys: opts.anchorKeys })) before returning, or have the caller special-case the resumed-job message so it doesn't divide by a placeholder zero.

New curate branches (resume entry, dry-run, --since scoping) have no direct tests

  • Severity: minor
  • Confidence: 80
  • Evidence: test/plugins/context-graph-enrich-batch.test.js (runCurateBatch tests at :149/:174 never pre-persist curate_job, set dryRun, or pass anchorKeys); no test exercises inWindowSessions (commands.js:119-134) or selectPending's anchorKeys filter (curate.js:198-210)
  • Why it matters: The daemon submit→collect→clear machinery is covered, so the resume mechanism is transitively exercised — but the backfill-specific deltas this PR adds are deterministic, locally-testable logic (CLAUDE.md: "add traditional tests for … config parsing and validation … and similar local contracts") and the riskiest/money-touching one (resume-instead-of-resubmit) is asserted nowhere directly. The dry-run "writes nothing" claim and the inWindowSessions UTC-midnight date boundary are easy, high-value asserts left uncovered.
  • Suggested fix: Add three small unit tests: (1) runCurateBatch with a pre-persisted curate_job resumes (polls + collects, no second batch.submit); (2) --dry-run returns the pool/cluster count and performs no updateState/append; (3) inWindowSessions includes a session whose latest ts is exactly the --since UTC midnight and excludes one a second before.

Reports: /Users/phil/workspace/hypaware/.git/dual-review/pr-130

Address PR #130 dual-review findings.

[major] Backfill and the daemon shared one un-namespaced curate_job slot,
so a manual `hyp enrich backfill --since` and an enabled daemon curate
source could collect/clear or clobber each other's batch — silently
dropping the --since scope or orphaning an already-billed batch.

Tag CurateJob with `source: 'backfill' | 'daemon'` and enforce ownership:
each driver resumes/collects/clears only its own job. runCurateBatch
resumes just its own backfill job and refuses (rather than overwriting the
single slot) when a daemon job is in flight; collectCurateJob gains an
`owner` param and leaves a foreign job untouched (phase 'foreign'), so a
daemon tick no longer collects a backfill job and vice-versa. A legacy job
with no source reads as `daemon`, its original owner.

[minor] The resume branch hardcoded `pending: 0`, so the caller printed a
misleading `N/0 processed`. Recompute the scoped pending count via
selectPending before collecting.

[minor] Add direct tests for the new branches: resume-from-persisted (no
re-submit), refusal on a daemon job, --dry-run writes-nothing, --since
anchorKeys scoping, daemon-leaves-backfill-job, inWindowSessions
UTC-midnight boundary, selectPending anchorKeys filter, and legacy/unknown
source coercion.

Document the shared-slot ownership decision in LLP 0028 #two-regimes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@philcunliffe

Copy link
Copy Markdown
Contributor Author

Addressed the dual-review findings (commit 292d251).

[major] Un-namespaced curate_job slot. Tagged CurateJob with source: 'backfill' | 'daemon' and enforced ownership — each driver resumes/collects/clears only its own job:

  • runCurateBatch resumes just its own backfill job; when a daemon job is in flight it refuses with a clear error rather than overwriting the single slot (which would orphan the daemon's already-billed batch). No more silent --since drop.
  • collectCurateJob gained an owner param (default daemon); a foreign job returns phase: 'foreign' untouched, so a daemon tick leaves a backfill job alone and vice-versa.
  • A legacy sidecar job with no source reads as daemon (its original owner) — back-compatible.

Residual: a narrow submit-window TOCTOU remains (backfill starts with an empty slot, daemon submits during backfill's build/submit window). Worst case is one orphaned daemon batch whose prospects stay pending and re-submit on a later tick — bounded wasted spend, no data loss. Fully closing it needs a cross-process lock, which isn't how this plugin coordinates.

[minor] Resume reported pending: 0 → misleading N/0 processed. The resume branch now recomputes the scoped pool via selectPending(runtime, { anchorKeys }) before collecting.

[minor] New branches untested. Added 12 direct tests: resume-from-persisted (no re-submit), refusal on a daemon job, --dry-run writes-nothing, --since anchorKeys scoping, daemon-leaves-backfill-job, inWindowSessions UTC-midnight boundary + timestamp coercion, selectPending anchorKeys filter, and legacy/unknown source coercion.

Documented the shared-slot ownership decision in LLP 0028 #two-regimes (same commit) and added one @ref.

Verification: npm test1310 pass, 0 fail; tsc --noEmit clean; ref-check → 34 refs, 0 broken.

@philcunliffe philcunliffe merged commit cbc9cb7 into master Jun 22, 2026
6 checks passed
@philcunliffe philcunliffe deleted the enrich-backfill-since-window branch June 22, 2026 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant