context-graph-enrich: bounded --since curate + crash-resumable Batch job#130
Conversation
Two operability fixes for the `hyp enrich backfill` curate path, both aimed at making a deliberate frontier-curate run safe to drive by hand. --since / --dry-run (bounded curate window) An unbounded cold-backfill pool is intractable: curate does a per-prospect vector recall plus greedy O(n^2) cosine clustering, so thousands of pending prospects peg CPU and never submit. `--since <YYYY-MM-DD>` scopes the pending pool to sessions active on or after that day (a few hundred prospects for a two-week slice), and `--dry-run` reports the scoped pool + cluster count without submitting. The bound is a read-side filter on selectPending, NOT a mutation: out-of-window prospects stay pending for a later, separately-scoped run rather than being deleted or skip-drained. Filter is config-driven (anchor_key_column / timestamp_column), so it stays source-agnostic. Crash-resumable Batch job (persist-and-resume) runCurateBatch now persists the in-flight job (batch id + cluster->prospect map) to state.curate_job on submit and clears it after results are committed, mirroring the daemon submit-and-collect path. A crash between submit and collect is now recoverable: re-running resumes the same batch instead of re-submitting (and re-billing) it. Adds parse-level tests for the new flags; updates LLP 0028 (two-regimes) to document the bounded-window lever. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWrfrpzfh9isrWV7umZLzK
Dual-agent review —
|
| Source | Finding (severity, evidence) | Intersects |
|---|---|---|
| Claude | Backfill/daemon share un-namespaced curate_job (major, batch.js:69-74,97) |
Concurrency surface + Risk #1 |
| Claude | Resume reports pending: 0 → misleading N/0 (minor, batch.js:73) |
Risk #3 (return shape) |
| Claude | New branches untested (minor, batch.test.js / inWindowSessions) | Risk #2 (coverage) |
| Codex | (no output — codex failed 3× on local proxy 127.0.0.1:8787 stream disconnect) |
— |
- Codex review: (no findings reported)
Claude review
Claude review
Backfill resume and the daemon share one un-namespaced curate_job slot
- Severity: major
- Confidence: 82
- Evidence: hypaware-core/plugins-workspace/context-graph-enrich/src/batch.js:69-74 (and the persist at :97), vs the daemon path at :138-172 / :319-323
- Why it matters: This PR makes the backfill
runCurateBatchboth write (:97) and resume/clear (:69-74,:114) the same singlestate.curate_jobslot the daemon's ongoing regime already owns (submitCurateJob/collectCurateJob/tick). The resume branch keys off any in-flight job with no source discriminator, so with the daemon curate source enabled (curate.enabled) a manualhyp enrich backfill --since …will (a) silently ignore--since— if a daemon job is in flight it collects that job and returns at:73without ever submitting the scoped batch, after already printing "scoped to --since … → N in-window session(s)"; and (b) let the daemontick()collect/clear the backfill's job (or vice-versa) — a concurrent double-collect. It is not data-loss or double-billing (whichever path collects commits the results, and downstream dedups), but it is a real, newly-introduced cross-regime surprise on a graph-mutating, money-adjacent operation. Before this PRrunCurateBatchneither read nor wrotecurate_job, so the collision surface is new. - Suggested fix: Tag the persisted job with its origin (e.g.
curate_job.source: 'backfill' | 'daemon') and only resume inrunCurateBatchwheninflight.source === 'backfill'; or, minimally, document and guard that a manual backfill curate must not run while the daemon curate source is enabled.
Resume path reports pending: 0, yielding a misleading N/0 processed line
- Severity: minor
- Confidence: 85
- Evidence: hypaware-core/plugins-workspace/context-graph-enrich/src/batch.js:73; printed at commands.js:101-103
- Why it matters: On the crash-recovery resume branch the return hardcodes
pending: 0, so the caller prints${cr.processed}/${cr.pending} processedas e.g.42/0 processed over 7 cluster(s)—processedexceedingpendingreads as nonsense on exactly the rare path (post-crash recovery) where the operator most needs a trustworthy summary. The actual enrichment is correct; only the reported count is wrong. - Suggested fix: On resume either recompute the scoped pending count (
selectPending(runtime, { anchorKeys: opts.anchorKeys })) before returning, or have the caller special-case the resumed-job message so it doesn't divide by a placeholder zero.
New curate branches (resume entry, dry-run, --since scoping) have no direct tests
- Severity: minor
- Confidence: 80
- Evidence: test/plugins/context-graph-enrich-batch.test.js (runCurateBatch tests at :149/:174 never pre-persist
curate_job, setdryRun, or passanchorKeys); no test exercisesinWindowSessions(commands.js:119-134) orselectPending'sanchorKeysfilter (curate.js:198-210) - Why it matters: The daemon submit→collect→clear machinery is covered, so the resume mechanism is transitively exercised — but the backfill-specific deltas this PR adds are deterministic, locally-testable logic (CLAUDE.md: "add traditional tests for … config parsing and validation … and similar local contracts") and the riskiest/money-touching one (resume-instead-of-resubmit) is asserted nowhere directly. The dry-run "writes nothing" claim and the
inWindowSessionsUTC-midnight date boundary are easy, high-value asserts left uncovered. - Suggested fix: Add three small unit tests: (1)
runCurateBatchwith a pre-persistedcurate_jobresumes (polls + collects, no secondbatch.submit); (2)--dry-runreturns the pool/cluster count and performs noupdateState/append; (3)inWindowSessionsincludes a session whose latesttsis exactly the--sinceUTC midnight and excludes one a second before.
Reports: /Users/phil/workspace/hypaware/.git/dual-review/pr-130
Address PR #130 dual-review findings. [major] Backfill and the daemon shared one un-namespaced curate_job slot, so a manual `hyp enrich backfill --since` and an enabled daemon curate source could collect/clear or clobber each other's batch — silently dropping the --since scope or orphaning an already-billed batch. Tag CurateJob with `source: 'backfill' | 'daemon'` and enforce ownership: each driver resumes/collects/clears only its own job. runCurateBatch resumes just its own backfill job and refuses (rather than overwriting the single slot) when a daemon job is in flight; collectCurateJob gains an `owner` param and leaves a foreign job untouched (phase 'foreign'), so a daemon tick no longer collects a backfill job and vice-versa. A legacy job with no source reads as `daemon`, its original owner. [minor] The resume branch hardcoded `pending: 0`, so the caller printed a misleading `N/0 processed`. Recompute the scoped pending count via selectPending before collecting. [minor] Add direct tests for the new branches: resume-from-persisted (no re-submit), refusal on a daemon job, --dry-run writes-nothing, --since anchorKeys scoping, daemon-leaves-backfill-job, inWindowSessions UTC-midnight boundary, selectPending anchorKeys filter, and legacy/unknown source coercion. Document the shared-slot ownership decision in LLP 0028 #two-regimes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Addressed the dual-review findings (commit 292d251). [major] Un-namespaced
Residual: a narrow submit-window TOCTOU remains (backfill starts with an empty slot, daemon submits during backfill's build/submit window). Worst case is one orphaned daemon batch whose prospects stay pending and re-submit on a later tick — bounded wasted spend, no data loss. Fully closing it needs a cross-process lock, which isn't how this plugin coordinates. [minor] Resume reported [minor] New branches untested. Added 12 direct tests: resume-from-persisted (no re-submit), refusal on a daemon job, Documented the shared-slot ownership decision in LLP 0028 Verification: |
Two operability fixes for the
hyp enrich backfillcurate path, both aimed at making a deliberate frontier-curate run safe to drive by hand. Landed while doing exactly that — a scoped 2-week curate over the live cache (496 prospects → 335 committed / 15 merged / 161 rejected, then projected into the graph).--since/--dry-run(bounded curate window)An unbounded cold-backfill pool is intractable: curate does a per-prospect vector recall plus greedy O(n²) cosine clustering, so thousands of pending prospects peg CPU and never submit (an earlier ~2,800-prospect attempt ran 42 min and never submitted).
--since <YYYY-MM-DD>scopes the pending pool to sessions active on or after that day (a few hundred prospects for a two-week slice).--dry-runreports the scoped pool + cluster count without submitting.selectPending, not a mutation: out-of-window prospects stay pending for a later, separately-scoped run rather than being deleted orskip-drained.anchor_key_column/timestamp_column), so it stays source-agnostic.Crash-resumable Batch job (persist-and-resume)
runCurateBatchnow persists the in-flight job (batch id + cluster→prospect map) tostate.curate_jobon submit and clears it after results are committed, mirroring the daemon submit-and-collect path. A crash between submit and collect is now recoverable: re-running resumes the same batch instead of re-submitting (and re-billing) it.Tests / docs
=-form/malformed/missing-value/--propose-onlyconflict).npm test: 1298 pass;tsc --noEmitclean.🤖 Generated with Claude Code
https://claude.ai/code/session_01AWrfrpzfh9isrWV7umZLzK