fix(sweep): merge grounding onto survivor on dedup collapse (#56)#57
Open
jimador wants to merge 8 commits into
Open
fix(sweep): merge grounding onto survivor on dedup collapse (#56)#57jimador wants to merge 8 commits into
jimador wants to merge 8 commits into
Conversation
#56) The dream-loop dedup sweep collapsed near-duplicate propositions by flipping the losers to STALE with a pure status change. Because STALE is excluded from retrieval, every loser's grounding and provenance became invisible the moment it was retired — real evidence silently lost on collapse. Fold the loser's evidence onto the survivor before retiring it. The policy decides intent, the runner does the IO (it is the only layer with a repository handle), preserving the existing SPI split: - SweepAction.MergeInto(survivorId, thenStatus): new outcome for a collapse. - MergingSweepPolicy: reads MarkReason.Duplicate.survivorId and returns MergeInto; falls back to a plain STALE transition when pinned, unmarked, or when no usable survivor id is present. StatusTransitionSweepPolicy is untouched. - DefaultCollectorRunner handles MergeInto: loads the survivor, folds the loser's grounding + provenance + sourceIds onto it (append + distinct, reusing the existing Proposition helpers), bumps reinforceCount by 1 to match LlmPropositionReviser's per-merge convention, saves the survivor, then transitions the loser and emits the same status-changed event the plain transition path emits. A missing survivor falls back to a plain retirement rather than throwing. Dry runs mutate nothing. Wiring: the dedup sweep opts into merge semantics via a dedicated builder entry point, CollectorRunner.Builder.withDuplicateDetection(), which pairs a DuplicateCollectorStrategy with MergingSweepPolicy. The global builder default stays StatusTransitionSweepPolicy so the decay/stale path is unchanged. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
- MergingSweepPolicyTest: decide() emits MergeInto for a Duplicate mark and falls back to STALE when pinned, unmarked, blank/self survivor id, or a non-duplicate mark; custom target status honored. - DefaultCollectorRunnerTest: live MergeInto folds the loser's grounding / provenance / sourceIds onto the survivor (deduped) with reinforceCount +1 then retires the loser to STALE; dry-run mutates nothing; a missing survivor falls back to a plain retirement; a multi-loser cluster has ALL losers' evidence absorbed (transitivity). - CollectorDedupMergesGroundingTest (integration): collapses a real cluster through InMemoryPropositionRepository + the withDuplicateDetection() wiring and asserts the survivor absorbed the loser's evidence while the loser is STALE. - StatusTransitionSweepPolicyTest: extend the sealed-family exhaustiveness check with the new MergeInto case. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…sing merge survivor (#56) Review follow-ups on the dedup-collapse merge: - withDuplicateDetection() no longer clobbers a policy the caller set explicitly via withPolicy(). It now installs MergingSweepPolicy only when the policy is still the untouched builder default, so a caller-chosen policy wins regardless of call order. Adds a precedence test and a KDoc note, plus a warning that the strategy paired here must never mark the survivor. - The missing-survivor MergeInto fallback now logs at WARN instead of DEBUG: its audit record looks like an ordinary transition, so the log is the only signal that a merge dropped its evidence fold. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…ing (#56) An in-place save of an already-stored proposition — a reinforce, a status change, or the dedup sweep folding a loser's grounding onto the survivor — is an update by id. DrivinePropositionRepository.save routes every non-blank-text save through findDuplicateId, which matches on (contextId, text) ignoring id and status; if a foreign same-text sibling exists it returned that sibling and never persisted the incoming object, silently dropping the update (and, for a retire, emitting an event describing the wrong node). The (contextId, text) uniqueness constraint normally makes a foreign same-text sibling impossible, so this is unreachable under the documented schema. But the schema is adopter-supplied, and save's contract — an update writes to its own node — must hold regardless. Guard the dedup-collapse so it only fires for a genuinely NEW insert (id not yet stored); an existing id always updates its own node via doPersist. The existsById probe runs only on the rare collision path (existingId != null), so genuine inserts pay nothing. Test drops the (contextId, text) constraint, seeds a same-text sibling, and proves the survivor keeps its merged grounding and the retire lands on the right node; it fails without the guard. Insert-dedup behaviour is unchanged. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
… score (#56) Neo4j's cosine vector index scores as (1 + cosine) / 2, and findClusters passed that straight through as SimilarityResult.score. The in-memory store returns raw cosine, so the two diverged: a cosine-0.5 pair read as 0.75. Consumers that treat the score as cosine — the multi-signal dedup scorer — over-merged distinct same-subject propositions, while the unit tests stayed green on the in-memory store. Convert back to raw cosine for both the threshold gate and the returned score. Regression test seeds two vectors with an exact cosine of 0.8 and asserts the reported score is 0.8, not 0.9. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Bind `2.0 * score - 1.0 AS cosine` once in the query instead of repeating the expression in both the threshold gate and the RETURN, and note that the conversion is COSINE-index-specific. Behaviour unchanged; tidies comments. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #56.
The dream-loop dedup sweep collapsed near-duplicate propositions by flipping the
losers to STALE with a pure status change. STALE is excluded from retrieval, so
each loser's grounding and provenance became invisible the moment it was retired
— evidence silently lost on every collapse. The ingestion-time reviser already
folds grounding onto the survivor; the sweep was the gap.
What changed
Fold the loser's evidence onto the survivor before retiring it. Policy decides
intent, the runner does the IO (it is the only layer with a repository handle),
keeping the existing SPI split:
SweepAction.MergeInto(survivorId, thenStatus)— new collapse outcome.MergingSweepPolicy— readsMarkReason.Duplicate.survivorId, returnsMergeInto; falls back to a plain STALE transition when pinned, unmarked, orwhen no usable survivor id is present.
StatusTransitionSweepPolicyisuntouched.
DefaultCollectorRunnerhandlesMergeInto: loads the survivor, folds theloser's grounding + provenance + sourceIds onto it (append + distinct via the
existing
Propositionhelpers), bumpsreinforceCountby 1 to match thereviser's per-merge convention, saves the survivor, then retires the loser and
emits the same status-changed event the plain path emits. A missing survivor
falls back to a plain retirement (logged at WARN, since the audit row looks
like an ordinary transition). Dry runs mutate nothing.
CollectorRunner.Builder.withDuplicateDetection(), which pairs aDuplicateCollectorStrategywithMergingSweepPolicy. It defers to a policyset explicitly via
withPolicy(), and the global builder default staysStatusTransitionSweepPolicy, so the decay/stale path is unchanged.Review notes
Adversarial review raised the production
save()path:findOrPersistdedupsby
(contextId, text), so an in-place update could in theory be hijacked by asame-text sibling. It's unreachable in practice —
save()dedups identical textat creation, so two same-text nodes can't coexist, and near-duplicate collapse
operates on different-text propositions whose exact text never collides. No
hot-path guard was added.
Caveat for adopters:
save()'s dedup contract assumes a(contextId, text)uniqueness constraint as a cross-instance backstop (declared in the test
SchemaCatalog). A deployment without it loses that backstop under concurrentwriters; worth applying the constraint in production schema.
Tests
MergingSweepPolicyTest—decide()emitsMergeIntofor a Duplicate mark;falls back to STALE when pinned/unmarked/blank-or-self survivor.
DefaultCollectorRunnerTest— live merge folds grounding/provenance/sourceIdsonto the survivor (deduped) with
reinforceCount + 1, then retires the loser;dry run mutates nothing; missing survivor falls back cleanly; multi-loser
cluster absorbs every loser's evidence;
withDuplicateDetectiondefers to anexplicit policy.
CollectorDedupMergesGroundingTest(integration) — collapses a real clusterand asserts the survivor absorbed the loser's evidence while the loser is STALE.