context-graph-enrich: bounded --since curate + crash-resumable Batch job by philcunliffe · Pull Request #130 · hyparam/hypaware

philcunliffe · 2026-06-22T16:43:31Z

Two operability fixes for the hyp enrich backfill curate path, both aimed at making a deliberate frontier-curate run safe to drive by hand. Landed while doing exactly that — a scoped 2-week curate over the live cache (496 prospects → 335 committed / 15 merged / 161 rejected, then projected into the graph).

`--since` / `--dry-run` (bounded curate window)

An unbounded cold-backfill pool is intractable: curate does a per-prospect vector recall plus greedy O(n²) cosine clustering, so thousands of pending prospects peg CPU and never submit (an earlier ~2,800-prospect attempt ran 42 min and never submitted).

--since <YYYY-MM-DD> scopes the pending pool to sessions active on or after that day (a few hundred prospects for a two-week slice).
--dry-run reports the scoped pool + cluster count without submitting.
The bound is a read-side filter on selectPending, not a mutation: out-of-window prospects stay pending for a later, separately-scoped run rather than being deleted or skip-drained.
Filter is config-driven (anchor_key_column / timestamp_column), so it stays source-agnostic.

Crash-resumable Batch job (persist-and-resume)

runCurateBatch now persists the in-flight job (batch id + cluster→prospect map) to state.curate_job on submit and clears it after results are committed, mirroring the daemon submit-and-collect path. A crash between submit and collect is now recoverable: re-running resumes the same batch instead of re-submitting (and re-billing) it.

Tests / docs

Parse-level tests for the new flags (valid/=-form/malformed/missing-value/--propose-only conflict).
LLP 0028 (two-regimes) documents the bounded-window lever.
npm test: 1298 pass; tsc --noEmit clean.

🤖 Generated with Claude Code

https://claude.ai/code/session_01AWrfrpzfh9isrWV7umZLzK

Two operability fixes for the `hyp enrich backfill` curate path, both aimed at making a deliberate frontier-curate run safe to drive by hand. --since / --dry-run (bounded curate window) An unbounded cold-backfill pool is intractable: curate does a per-prospect vector recall plus greedy O(n^2) cosine clustering, so thousands of pending prospects peg CPU and never submit. `--since <YYYY-MM-DD>` scopes the pending pool to sessions active on or after that day (a few hundred prospects for a two-week slice), and `--dry-run` reports the scoped pool + cluster count without submitting. The bound is a read-side filter on selectPending, NOT a mutation: out-of-window prospects stay pending for a later, separately-scoped run rather than being deleted or skip-drained. Filter is config-driven (anchor_key_column / timestamp_column), so it stays source-agnostic. Crash-resumable Batch job (persist-and-resume) runCurateBatch now persists the in-flight job (batch id + cluster->prospect map) to state.curate_job on submit and clears it after results are committed, mirroring the daemon submit-and-collect path. A crash between submit and collect is now recoverable: re-running resumes the same batch instead of re-submitting (and re-billing) it. Adds parse-level tests for the new flags; updates LLP 0028 (two-regimes) to document the bounded-window lever. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWrfrpzfh9isrWV7umZLzK

philcunliffe · 2026-06-22T18:04:07Z

Dual-agent review — `request_changes`

Verdict: request_changes
Risk class: medium
Auto-merge advisory: 👎 thumbs down — verdict is request_changes; needs human-gated follow-up
Reviewers: Claude (3 findings: 1 major, 2 minor); Codex produced no output (failed 3× on a local proxy 127.0.0.1:8787 stream disconnect — infra, not a review signal).

Advisory only: no merge was attempted.

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Source	Finding (severity, evidence)	Intersects
Claude	Backfill/daemon share un-namespaced `curate_job` (major, batch.js:69-74,97)	Concurrency surface + Risk #1
Claude	Resume reports `pending: 0` → misleading `N/0` (minor, batch.js:73)	Risk #3 (return shape)
Claude	New branches untested (minor, batch.test.js / inWindowSessions)	Risk #2 (coverage)
Codex	(no output — codex failed 3× on local proxy `127.0.0.1:8787` stream disconnect)	—

Codex review: (no findings reported)

Claude review

Claude review

Backfill resume and the daemon share one un-namespaced `curate_job` slot

Severity: major
Confidence: 82
Evidence: hypaware-core/plugins-workspace/context-graph-enrich/src/batch.js:69-74 (and the persist at :97), vs the daemon path at :138-172 / :319-323
Why it matters: This PR makes the backfill runCurateBatch both write (:97) and resume/clear (:69-74, :114) the same single state.curate_job slot the daemon's ongoing regime already owns (submitCurateJob/collectCurateJob/tick). The resume branch keys off any in-flight job with no source discriminator, so with the daemon curate source enabled (curate.enabled) a manual hyp enrich backfill --since … will (a) silently ignore --since — if a daemon job is in flight it collects that job and returns at :73 without ever submitting the scoped batch, after already printing "scoped to --since … → N in-window session(s)"; and (b) let the daemon tick() collect/clear the backfill's job (or vice-versa) — a concurrent double-collect. It is not data-loss or double-billing (whichever path collects commits the results, and downstream dedups), but it is a real, newly-introduced cross-regime surprise on a graph-mutating, money-adjacent operation. Before this PR runCurateBatch neither read nor wrote curate_job, so the collision surface is new.
Suggested fix: Tag the persisted job with its origin (e.g. curate_job.source: 'backfill' | 'daemon') and only resume in runCurateBatch when inflight.source === 'backfill'; or, minimally, document and guard that a manual backfill curate must not run while the daemon curate source is enabled.

Resume path reports `pending: 0`, yielding a misleading `N/0 processed` line

Severity: minor
Confidence: 85
Evidence: hypaware-core/plugins-workspace/context-graph-enrich/src/batch.js:73; printed at commands.js:101-103
Why it matters: On the crash-recovery resume branch the return hardcodes pending: 0, so the caller prints ${cr.processed}/${cr.pending} processed as e.g. 42/0 processed over 7 cluster(s) — processed exceeding pending reads as nonsense on exactly the rare path (post-crash recovery) where the operator most needs a trustworthy summary. The actual enrichment is correct; only the reported count is wrong.
Suggested fix: On resume either recompute the scoped pending count (selectPending(runtime, { anchorKeys: opts.anchorKeys })) before returning, or have the caller special-case the resumed-job message so it doesn't divide by a placeholder zero.

New curate branches (resume entry, dry-run, `--since` scoping) have no direct tests

Severity: minor
Confidence: 80
Evidence: test/plugins/context-graph-enrich-batch.test.js (runCurateBatch tests at :149/:174 never pre-persist curate_job, set dryRun, or pass anchorKeys); no test exercises inWindowSessions (commands.js:119-134) or selectPending's anchorKeys filter (curate.js:198-210)
Why it matters: The daemon submit→collect→clear machinery is covered, so the resume mechanism is transitively exercised — but the backfill-specific deltas this PR adds are deterministic, locally-testable logic (CLAUDE.md: "add traditional tests for … config parsing and validation … and similar local contracts") and the riskiest/money-touching one (resume-instead-of-resubmit) is asserted nowhere directly. The dry-run "writes nothing" claim and the inWindowSessions UTC-midnight date boundary are easy, high-value asserts left uncovered.
Suggested fix: Add three small unit tests: (1) runCurateBatch with a pre-persisted curate_job resumes (polls + collects, no second batch.submit); (2) --dry-run returns the pool/cluster count and performs no updateState/append; (3) inWindowSessions includes a session whose latest ts is exactly the --since UTC midnight and excludes one a second before.

Reports: /Users/phil/workspace/hypaware/.git/dual-review/pr-130

Address PR #130 dual-review findings. [major] Backfill and the daemon shared one un-namespaced curate_job slot, so a manual `hyp enrich backfill --since` and an enabled daemon curate source could collect/clear or clobber each other's batch — silently dropping the --since scope or orphaning an already-billed batch. Tag CurateJob with `source: 'backfill' | 'daemon'` and enforce ownership: each driver resumes/collects/clears only its own job. runCurateBatch resumes just its own backfill job and refuses (rather than overwriting the single slot) when a daemon job is in flight; collectCurateJob gains an `owner` param and leaves a foreign job untouched (phase 'foreign'), so a daemon tick no longer collects a backfill job and vice-versa. A legacy job with no source reads as `daemon`, its original owner. [minor] The resume branch hardcoded `pending: 0`, so the caller printed a misleading `N/0 processed`. Recompute the scoped pending count via selectPending before collecting. [minor] Add direct tests for the new branches: resume-from-persisted (no re-submit), refusal on a daemon job, --dry-run writes-nothing, --since anchorKeys scoping, daemon-leaves-backfill-job, inWindowSessions UTC-midnight boundary, selectPending anchorKeys filter, and legacy/unknown source coercion. Document the shared-slot ownership decision in LLP 0028 #two-regimes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

philcunliffe · 2026-06-22T18:34:41Z

Addressed the dual-review findings (commit 292d251).

[major] Un-namespaced curate_job slot. Tagged CurateJob with source: 'backfill' | 'daemon' and enforced ownership — each driver resumes/collects/clears only its own job:

runCurateBatch resumes just its own backfill job; when a daemon job is in flight it refuses with a clear error rather than overwriting the single slot (which would orphan the daemon's already-billed batch). No more silent --since drop.
collectCurateJob gained an owner param (default daemon); a foreign job returns phase: 'foreign' untouched, so a daemon tick leaves a backfill job alone and vice-versa.
A legacy sidecar job with no source reads as daemon (its original owner) — back-compatible.

Residual: a narrow submit-window TOCTOU remains (backfill starts with an empty slot, daemon submits during backfill's build/submit window). Worst case is one orphaned daemon batch whose prospects stay pending and re-submit on a later tick — bounded wasted spend, no data loss. Fully closing it needs a cross-process lock, which isn't how this plugin coordinates.

[minor] Resume reported pending: 0 → misleading N/0 processed. The resume branch now recomputes the scoped pool via selectPending(runtime, { anchorKeys }) before collecting.

[minor] New branches untested. Added 12 direct tests: resume-from-persisted (no re-submit), refusal on a daemon job, --dry-run writes-nothing, --since anchorKeys scoping, daemon-leaves-backfill-job, inWindowSessions UTC-midnight boundary + timestamp coercion, selectPending anchorKeys filter, and legacy/unknown source coercion.

Documented the shared-slot ownership decision in LLP 0028 #two-regimes (same commit) and added one @ref.

Verification: npm test → 1310 pass, 0 fail; tsc --noEmit clean; ref-check → 34 refs, 0 broken.

philcunliffe merged commit cbc9cb7 into master Jun 22, 2026
6 checks passed

philcunliffe deleted the enrich-backfill-since-window branch June 22, 2026 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

context-graph-enrich: bounded --since curate + crash-resumable Batch job#130

context-graph-enrich: bounded --since curate + crash-resumable Batch job#130
philcunliffe merged 2 commits into
masterfrom
enrich-backfill-since-window

philcunliffe commented Jun 22, 2026

Uh oh!

philcunliffe commented Jun 22, 2026

Claude review

Backfill resume and the daemon share one un-namespaced `curate_job` slot

Resume path reports `pending: 0`, yielding a misleading `N/0 processed` line

New curate branches (resume entry, dry-run, `--since` scoping) have no direct tests

Uh oh!

philcunliffe commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

philcunliffe commented Jun 22, 2026

--since / --dry-run (bounded curate window)

Crash-resumable Batch job (persist-and-resume)

Tests / docs

Uh oh!

philcunliffe commented Jun 22, 2026

Dual-agent review — request_changes

Risk capstone

Cross-reference: reviewer findings vs high-risk surfaces

Claude review

Backfill resume and the daemon share one un-namespaced curate_job slot

Resume path reports pending: 0, yielding a misleading N/0 processed line

New curate branches (resume entry, dry-run, --since scoping) have no direct tests

Uh oh!

philcunliffe commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`--since` / `--dry-run` (bounded curate window)

Dual-agent review — `request_changes`

Backfill resume and the daemon share one un-namespaced `curate_job` slot

Resume path reports `pending: 0`, yielding a misleading `N/0 processed` line

New curate branches (resume entry, dry-run, `--since` scoping) have no direct tests