Skip to content

fix(sweep): merge grounding onto survivor on dedup collapse (#56)#57

Open
jimador wants to merge 8 commits into
mainfrom
feat/56-dedup-grounding-merge
Open

fix(sweep): merge grounding onto survivor on dedup collapse (#56)#57
jimador wants to merge 8 commits into
mainfrom
feat/56-dedup-grounding-merge

Conversation

@jimador

@jimador jimador commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Closes #56.

The dream-loop dedup sweep collapsed near-duplicate propositions by flipping the
losers to STALE with a pure status change. STALE is excluded from retrieval, so
each loser's grounding and provenance became invisible the moment it was retired
— evidence silently lost on every collapse. The ingestion-time reviser already
folds grounding onto the survivor; the sweep was the gap.

What changed

Fold the loser's evidence onto the survivor before retiring it. Policy decides
intent, the runner does the IO (it is the only layer with a repository handle),
keeping the existing SPI split:

  • SweepAction.MergeInto(survivorId, thenStatus) — new collapse outcome.
  • MergingSweepPolicy — reads MarkReason.Duplicate.survivorId, returns
    MergeInto; falls back to a plain STALE transition when pinned, unmarked, or
    when no usable survivor id is present. StatusTransitionSweepPolicy is
    untouched.
  • DefaultCollectorRunner handles MergeInto: loads the survivor, folds the
    loser's grounding + provenance + sourceIds onto it (append + distinct via the
    existing Proposition helpers), bumps reinforceCount by 1 to match the
    reviser's per-merge convention, saves the survivor, then retires the loser and
    emits the same status-changed event the plain path emits. A missing survivor
    falls back to a plain retirement (logged at WARN, since the audit row looks
    like an ordinary transition). Dry runs mutate nothing.
  • Opt in via CollectorRunner.Builder.withDuplicateDetection(), which pairs a
    DuplicateCollectorStrategy with MergingSweepPolicy. It defers to a policy
    set explicitly via withPolicy(), and the global builder default stays
    StatusTransitionSweepPolicy, so the decay/stale path is unchanged.

Review notes

Adversarial review raised the production save() path: findOrPersist dedups
by (contextId, text), so an in-place update could in theory be hijacked by a
same-text sibling. It's unreachable in practice — save() dedups identical text
at creation, so two same-text nodes can't coexist, and near-duplicate collapse
operates on different-text propositions whose exact text never collides. No
hot-path guard was added.

Caveat for adopters: save()'s dedup contract assumes a (contextId, text)
uniqueness constraint as a cross-instance backstop (declared in the test
SchemaCatalog). A deployment without it loses that backstop under concurrent
writers; worth applying the constraint in production schema.

Tests

  • MergingSweepPolicyTestdecide() emits MergeInto for a Duplicate mark;
    falls back to STALE when pinned/unmarked/blank-or-self survivor.
  • DefaultCollectorRunnerTest — live merge folds grounding/provenance/sourceIds
    onto the survivor (deduped) with reinforceCount + 1, then retires the loser;
    dry run mutates nothing; missing survivor falls back cleanly; multi-loser
    cluster absorbs every loser's evidence; withDuplicateDetection defers to an
    explicit policy.
  • CollectorDedupMergesGroundingTest (integration) — collapses a real cluster
    and asserts the survivor absorbed the loser's evidence while the loser is STALE.

jimador added 6 commits July 1, 2026 01:21
#56)

The dream-loop dedup sweep collapsed near-duplicate propositions by flipping
the losers to STALE with a pure status change. Because STALE is excluded from
retrieval, every loser's grounding and provenance became invisible the moment
it was retired — real evidence silently lost on collapse.

Fold the loser's evidence onto the survivor before retiring it. The policy
decides intent, the runner does the IO (it is the only layer with a repository
handle), preserving the existing SPI split:

- SweepAction.MergeInto(survivorId, thenStatus): new outcome for a collapse.
- MergingSweepPolicy: reads MarkReason.Duplicate.survivorId and returns
  MergeInto; falls back to a plain STALE transition when pinned, unmarked, or
  when no usable survivor id is present. StatusTransitionSweepPolicy is
  untouched.
- DefaultCollectorRunner handles MergeInto: loads the survivor, folds the
  loser's grounding + provenance + sourceIds onto it (append + distinct, reusing
  the existing Proposition helpers), bumps reinforceCount by 1 to match
  LlmPropositionReviser's per-merge convention, saves the survivor, then
  transitions the loser and emits the same status-changed event the plain
  transition path emits. A missing survivor falls back to a plain retirement
  rather than throwing. Dry runs mutate nothing.

Wiring: the dedup sweep opts into merge semantics via a dedicated builder entry
point, CollectorRunner.Builder.withDuplicateDetection(), which pairs a
DuplicateCollectorStrategy with MergingSweepPolicy. The global builder default
stays StatusTransitionSweepPolicy so the decay/stale path is unchanged.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
- MergingSweepPolicyTest: decide() emits MergeInto for a Duplicate mark and
  falls back to STALE when pinned, unmarked, blank/self survivor id, or a
  non-duplicate mark; custom target status honored.
- DefaultCollectorRunnerTest: live MergeInto folds the loser's grounding /
  provenance / sourceIds onto the survivor (deduped) with reinforceCount +1 then
  retires the loser to STALE; dry-run mutates nothing; a missing survivor falls
  back to a plain retirement; a multi-loser cluster has ALL losers' evidence
  absorbed (transitivity).
- CollectorDedupMergesGroundingTest (integration): collapses a real cluster
  through InMemoryPropositionRepository + the withDuplicateDetection() wiring and
  asserts the survivor absorbed the loser's evidence while the loser is STALE.
- StatusTransitionSweepPolicyTest: extend the sealed-family exhaustiveness check
  with the new MergeInto case.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…sing merge survivor (#56)

Review follow-ups on the dedup-collapse merge:

- withDuplicateDetection() no longer clobbers a policy the caller set
  explicitly via withPolicy(). It now installs MergingSweepPolicy only when the
  policy is still the untouched builder default, so a caller-chosen policy wins
  regardless of call order. Adds a precedence test and a KDoc note, plus a
  warning that the strategy paired here must never mark the survivor.
- The missing-survivor MergeInto fallback now logs at WARN instead of DEBUG:
  its audit record looks like an ordinary transition, so the log is the only
  signal that a merge dropped its evidence fold.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…ing (#56)

An in-place save of an already-stored proposition — a reinforce, a status
change, or the dedup sweep folding a loser's grounding onto the survivor — is
an update by id. DrivinePropositionRepository.save routes every non-blank-text
save through findDuplicateId, which matches on (contextId, text) ignoring id and
status; if a foreign same-text sibling exists it returned that sibling and never
persisted the incoming object, silently dropping the update (and, for a retire,
emitting an event describing the wrong node).

The (contextId, text) uniqueness constraint normally makes a foreign same-text
sibling impossible, so this is unreachable under the documented schema. But the
schema is adopter-supplied, and save's contract — an update writes to its own
node — must hold regardless. Guard the dedup-collapse so it only fires for a
genuinely NEW insert (id not yet stored); an existing id always updates its own
node via doPersist. The existsById probe runs only on the rare collision path
(existingId != null), so genuine inserts pay nothing.

Test drops the (contextId, text) constraint, seeds a same-text sibling, and
proves the survivor keeps its merged grounding and the retire lands on the right
node; it fails without the guard. Insert-dedup behaviour is unchanged.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
@jimador jimador marked this pull request as ready for review July 1, 2026 06:02
jimador added 2 commits July 1, 2026 09:03
… score (#56)

Neo4j's cosine vector index scores as (1 + cosine) / 2, and findClusters passed
that straight through as SimilarityResult.score. The in-memory store returns raw
cosine, so the two diverged: a cosine-0.5 pair read as 0.75. Consumers that treat
the score as cosine — the multi-signal dedup scorer — over-merged distinct
same-subject propositions, while the unit tests stayed green on the in-memory store.

Convert back to raw cosine for both the threshold gate and the returned score.
Regression test seeds two vectors with an exact cosine of 0.8 and asserts the
reported score is 0.8, not 0.9.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Bind `2.0 * score - 1.0 AS cosine` once in the query instead of repeating the
expression in both the threshold gate and the RETURN, and note that the
conversion is COSINE-index-specific. Behaviour unchanged; tidies comments.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dedup sweep drops grounding/provenance when collapsing duplicates (pure STALE flip, no merge onto survivor)

1 participant