feat(bm25): BM25 ranked text search on standard posting lists#9730
Open
shaunpatterson wants to merge 21 commits into
Open
feat(bm25): BM25 ranked text search on standard posting lists#9730shaunpatterson wants to merge 21 commits into
shaunpatterson wants to merge 21 commits into
Conversation
Add BM25 relevance-ranked text search to Dgraph, enabling users to query text predicates and receive results ordered by relevance score instead of boolean matching. Implementation: - New BM25 tokenizer using the fulltext pipeline (normalize, stopwords, stem) that preserves term frequencies for TF counting - BM25-specific index storage: per-term TF posting lists, doc length lists, and corpus statistics (doc count, total terms) - Query execution with full BM25 scoring: score = IDF * (k+1) * tf / (k * (1 - b + b * dl/avgDL) + tf) IDF = log1p((N - df + 0.5) / (df + 0.5)) - DQL syntax: bm25(predicate, "query" [, "k", "b"]) as root func or filter - Schema syntax: @index(bm25) - Parameter validation (k > 0, 0 <= b <= 1) - Early UID intersection for filter-mode performance - All-stopword document and query handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three critical bugs fixed: 1. REF postings lose Value during rollup: The posting list encode/rollup cycle strips the Value field from REF postings without facets (list.go:1630). BM25 term frequencies and doc lengths were stored in Value and lost. Fix: Store TF and doclen as facets on REF postings, which are preserved. 2. Missing function validation: query/query.go has a separate isValidFuncName check from dql/parser.go. "bm25" was only added to the parser, causing "Invalid function name: bm25" at query time. 3. Unsorted UIDs break query pipeline: BM25 returned UIDs sorted by score, but the query pipeline (algo.MergeSorted, child predicate fetching) requires UID-ascending order. Fix: Sort UIDs ascending in UidMatrix, apply first/offset pagination on score-sorted results before UID sorting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the facet-based BM25 storage (~40-50 bytes/posting) with compact varint-encoded binary blobs stored as direct Badger KV entries (~4-6 bytes/posting, ~10x reduction). Add bm25_score pseudo-predicate for variable-based score ordering following the similar_to pattern. - Add posting/bm25enc package for compact binary encode/decode - Rewrite write path in posting/index.go for direct Badger KV - Add bm25Writes buffer to LocalCache with read-your-own-writes - Flush BM25 blobs in CommitToDisk with BitBM25Data UserMeta - Rewrite read path in worker/task.go with direct blob decoding - Add bm25_score pseudo-predicate in query/query.go - Add score ordering integration tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cases Cover incremental add/update/delete, IDF score stability as corpus grows, large corpus pagination, unicode, stopwords, uid filtering, score validation, and concurrent batch adds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…c tests Addresses test coverage gaps identified during code review against ArangoDB's BM25 implementation: - TestBM25ExactScoreValues: validates numerical correctness of BM25 formula using b=0 to enable hand-computed expected scores - TestBM25BM15NoLengthNormalization: verifies b=0 disables length normalization and contrasts with default b=0.75 behavior - TestBM25SingleMatchingDocument: covers df=1 edge case with high IDF Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 1 of BM25 scaling plan. Introduces bm25block package with: - BlockMeta/Dir types for block directory encoding/decoding - SplitIntoBlocks: splits monolithic entry slices into 128-entry blocks - MergeAllBlocks: compacts overlapping blocks with dedup and tombstone removal - ComputeUBPre/SuffixMaxUBPre: WAND upper-bound precomputation - New key functions: BM25TermDirKey, BM25TermBlockKey, BM25DocLenDirKey, BM25DocLenBlockKey for block-addressed Badger KV storage 17 unit tests and benchmarks for the block storage format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phases 2-4 of BM25 scaling plan: Phase 2 - Segmented mutation path: - addBM25IndexMutations now writes to block-based storage - Each term's postings split into ~128-entry blocks with a directory - Blocks automatically split when exceeding 256 entries - Doc-length list also uses block-based storage - Block removal and directory cleanup on deletes Phase 3 - WAND top-k query path: - New bm25wand.go with listIter for block-based posting list iteration - WAND algorithm with min-heap for top-k early termination - Per-block upper bounds (UBPre) computed from maxTF at query time - Suffix-max UBPre for efficient threshold checking - Falls back to scoring all docs when no first: limit or offset is used Phase 4 - Block-Max WAND: - skipToWithBMW skips entire blocks whose UB + other terms can't beat theta - Avoids Badger reads for blocks that can't contribute to top-k - Enabled by default in handleBM25Search Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 5 - Migration support: - newListIter falls back to legacy monolithic blob when no block directory exists - lookupDocLen falls back to legacy BM25DocLenKey blob - wandSearch falls back to legacy BM25IndexKey for df computation - Legacy data transparently served through synthetic single-block directory - New writes always use block format; old data works until overwritten Unit tests for WAND components: - TestTopKHeapBasic: heap operations, threshold, eviction - TestTopKHeapTieBreaking: deterministic ordering on score ties - TestBm25ScoreFunction: formula verification, tf/dl/b edge cases - TestBm25ScoreNaN: no NaN/Inf for edge-case inputs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fixes critical bugs and performance issues identified by GPT-5 review: - Fix negative inBlockPos panic: guard currentDoc/currentTF/skipTo against inBlockPos < 0 (possible before first next() call) - Fix empty block pathological behavior: next()/skipTo()/skipToWithBMW() now skip empty blocks instead of leaving iterator in invalid state with MaxUint64 pivotDoc - Fix legacy loadBlock: no longer resets inBlockPos to 0 (was moving pointer backwards, could cause re-scoring or infinite loops) - Fix remainingUB panic: guard against blockIdx < 0 (before first next()) - Add docLenCache: caches doclen directory + block reads within a single query, avoiding repeated Badger reads per scored document - Optimize BMW otherUB: compute as sumUB - thisUB (O(1)) instead of iterating all other terms (O(q^2) -> O(q)) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…UB underestimate Three fixes: 1. CRITICAL: addBM25IndexMutations now checks if a UID already exists in doclen blocks before incrementing stats, preventing double-counting on SET when the document was already indexed (defensive guard for batch mutations). 2. HIGH: WAND sumUB now accumulates across ALL iterators (not just up to pivot), so BMW's otherUB calculation is correct and won't skip valid candidate blocks. 3. PERF: newListIter accepts pre-read Dir to eliminate duplicate Badger reads (directory was read once for df, then again inside newListIter). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ength Defensive hardening from GPT-5 review: if inBlockPos exceeds block length after next() reaches end of block, the sort.Search span could go negative. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add DecodeCount() to bm25enc for O(1) entry count reads without full decode, preventing OOM on legacy migration with large posting lists (e.g., common terms with millions of entries) - Use DecodeCount in WAND search legacy DF calculation path - Fix integer overflow in DecodeDir bounds check by using uint64 arithmetic (prevents panic on corrupted data with MaxUint32 count) - Pre-allocate shared score buffer in handleBM25Search with three-index slices to prevent accidental append corruption - Document bm25Writes concurrency model and limitations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the parallel block-storage + retrieval stack (declined in review) with an implementation that rides Dgraph's standard posting-list machinery, addressing the maintainer's feedback (independently endorsed by GPT-5 and Gemini). Net ~1300 fewer lines. Storage / indexing: - BM25 term postings are standard index posting lists at IndexKey(attr, IdentBM25||term), written via the normal delta path, so they inherit MVCC, deltas, rollup, splits, backup and snapshot. Each posting is a REF posting whose value packs (term-frequency, doc-length) as two uvarints. - Fix the linchpin: List.encode() now retains REF postings that carry a value through rollup (otherwise the term frequency was silently stripped). Mirrors how faceted postings already coexist in Pack (uid) + Postings (payload). - Document length is packed into the posting value rather than a separate list, avoiding a write-hot doclen key and a per-candidate random read at query time. - Corpus stats (docCount, totalTerms) are sharded across 32 buckets keyed by uid%32 so concurrent writers rarely contend, while same-bucket updates still conflict-and-retry (mirrors the @count pattern). Term postings get the standard index conflict key (fingerprint(key)^uid), so two docs sharing a term commit concurrently without conflict -- resolving the concurrency regression that the block version only mitigated via Raft serialization. - Delete posting/bm25block and posting/bm25enc; remove LocalCache.bm25Writes, BitBM25Data, and the BM25 commit branch in mvcc.go. Query / scoring: - WAND / Block-Max WAND reworked over the standard posting-list iterator: per-term cursors are materialized from the in-memory List with per-128-posting maxTF/minDocLen bounds -- no parallel block format, no proto changes. - Surface the score via Dgraph's existing value-variable mechanism: the bm25 root function binds its per-doc score to its own variable (Uids + Vals), so `scores as var(func: bm25(...))` works with uid(scores), val(scores) and orderdesc: val(scores). Removes the bm25_score pseudo-predicate and the __bm25_scores__ ParentVars channel. - Skip the query-layer pagination pass for bm25 roots (the worker already paginates over score order), mirroring the existing `has` handling, to avoid double-applying first/offset. Tests: - Rollup TF/doc-length survival, bucketed-stats accumulation (incl. same-bucket in-transaction read-your-own-writes and deletes), value-codec round trip, and a 200-trial randomized WAND/Block-Max-WAND vs brute-force correctness check. - Convert the query integration tests to the value-variable syntax. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
updateBM25Stats maintains the bucketed (docCount, totalTerms) counters via read-modify-write, but read them with LocalCache.GetFromDelta, which skips disk and returns only the current transaction's in-memory delta. Across separately committed mutations each transaction therefore started from zero and overwrote its stats bucket instead of accumulating, collapsing the corpus document count (e.g. to the per-term df). Since avgDL = totalTerms/docCount and idf depends on N = docCount, every length- and idf-weighted BM25 score was wrong, while result ordering (a constant idf factor for a single-term query) still looked correct. Read the committed stats with LocalCache.Get instead. Term postings are unaffected: they are additive deltas that merge through the normal posting-list mechanism and never read-modify-write. Found by the BM25 integration tests (exact-score and uid-filter cases). The prior unit test only exercised a single transaction, where read-your-own-writes masked the bug; add TestBM25StatsAccumulateAcrossTxns covering the multi- transaction path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BM25 scores a single document (one value) per UID, so per-document length and corpus statistics are ill-defined for a list predicate. The bucketed stats also rely on conflict detection that a list predicate's value-dependent conflict key would not provide (a code-review concern about stats integrity on list/ @noconflict predicates). Reject the combination in checkSchema rather than silently mis-scoring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The redesign rationale (including the doc-length storage decision) lives in code comments and the PR description; the planning doc does not belong in the tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BM25 term postings live under the IdentBM25 token prefix, but the corpus
statistics buckets live under a separate reserved token prefix
("\x00_bm25stats_"). prefixesToDeleteTokensFor only emitted the
tokenizer-identifier prefix, so dropping or rebuilding a BM25 index left the
stats keys orphaned. On rebuild the fresh stats then accumulated on top of the
stale ones, double-counting docCount/totalTerms and corrupting avgDL/IDF.
Add x.BM25StatsPrefix and include it (plus its ByteSplit variant) in the bm25
deletion prefixes so the stats are removed together with the term postings.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The streaming index rebuild processes a predicate's documents across ~16 goroutines, each with its own transaction/cache that is periodically reset (NewTxn) every ~10k posting lists, emitting deltas to a temp store. BM25 corpus statistics were maintained by a per-transaction read-modify-write of a single value posting per bucket, which collapses under that model: every thread (and every post-reset segment) reads an empty base and writes its own partial total, and the value postings merge last-write-wins — so the rebuilt docCount/totalTerms were severely undercounted, distorting avgDL and IDF. Even a first-time @index(bm25) over existing data produced wrong scores. Route stats through a shared, concurrency-safe accumulator (Txn.bm25Acc, set on the rebuild's per-thread txns) instead of the RMW counter, then flush all 32 buckets once as a single writer. The flush is written into the temp store as ordinary delta postings so the rebuild's second phase rolls it up into pstore at startTs alongside the term postings — no separate commit path. The live mutation path is unchanged and was already correct: on a scalar predicate fingerprintEdge returns the MaxUint64 sentinel for every untagged value, so concurrent same-bucket writers share a conflict key and one retries (this is why @index(bm25) is rejected on list predicates). Added a regression test for that invariant, a unit test for the rebuild accumulator, and integration tests for rebuild-on-existing-data, drop-then-re-add (no stale-stats double counting), and concurrent overlapping mutations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
handleBM25Search only engaged WAND top-k early termination when first was set WITHOUT an offset. Any query with an offset (or a bm25 filter) fell through to scoreAllDocs, which materializes every matching document into a map — and wandSearch already reads each query term's entire posting list into memory. A deep-paginated query over a hot term could therefore allocate and score the whole corpus to return a handful of rows. Engage WAND whenever a first limit is present, retaining first+offset documents and dropping the offset afterward (bm25TopK / bm25PaginateScored, extracted as pure, unit-tested helpers). With no first limit, every match is still scored — the caller explicitly asked for all results. wandSearch always returns score-descending results, so a single pagination path is correct for both the top-k and score-all cases. Also adds TestWandFilteredMatchesBruteForce: the existing WAND brute-force comparison only exercised a nil filter, leaving @filter(bm25(...)) pruning unverified. The new test fuzzes random filter subsets across WAND, Block-Max WAND, and the score-all path against exhaustive filtered scoring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The bulk loader ran BM25 through the generic tokenizer path, which produced a silently broken, unsearchable index: term postings were written as bare REF postings with no value (addMapEntry drops the payload of any REF posting without facets), so the packed (term frequency, document length) was lost and ReadBM25TermPostings skipped every posting; and no corpus statistics were written at all, so docCount was 0 and every bm25() query short-circuited to empty. - addMapEntry now retains REF postings that carry a Value (mirroring the len(p.Value) > 0 clause in List.encode), so BM25 term postings keep their tf/dl. - addIndexMapEntries special-cases the BM25 tokenizer: one posting per distinct term with the packed value, and per-document corpus-stat partials accumulated per mapper. - After the map phase, loader.flushBM25Stats merges every mapper's partials and writes one value posting per (predicate, bucket) into the predicate's map shard, so the reduce emits a single stats posting per bucket in the same format the live and rebuild paths use. Stats are summed across mappers (not unioned like postings), which is the correctness crux, covered by TestMergeBM25Stats. Adds an end-to-end bulk -> bm25() query integration test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supersedes #9652 — reworked per the review feedback there.
What
Adds BM25 ranked full-text search: a new
bm25index tokenizer plus abm25(predicate, "query", [k], [b])query function.How — addressing the #9652 review
The previous version was declined for carrying a parallel storage + retrieval stack alongside the existing one. This rework lands BM25 entirely on Dgraph's standard machinery:
IndexKey(attr, IdentBM25‖term), written throughaddMutation— inheriting MVCC, deltas, rollup, splits, backup, and snapshot. Each posting is a REF posting whose value packs(term-frequency, doc-length).List.encode()now retains REF postings that carry a value through rollup (previously the value was stripped). This mirrors how faceted postings already coexist in bothPack(uid) andPostings(payload). Covered by a forced-rollup regression test.bm25block/bm25enc, theLocalCachesecond write path, theBitBM25Datauser-meta, and the dedicatedmvcc.gocommit branch are gone. MVCC does the delta merging.fingerprint(key)^uid), removing the race the previous version mitigated only via Raft serialization.uid % 32to avoid a single write-hot key, and summed at read time.bm25function binds its per-document score to a value variable, so it works through the existingval()system. Nobm25_scorepseudo-predicate and no newParentVarschannel.One deviation worth your call
The review suggested storing document length "next to the predicate's existing length/count machinery." I instead packed
docLeninto the posting value alongside TF: a single per-predicate doclen list is a write-conflict hotspot and forces a random read per scored candidate, whereas packing lets scoring read(uid, tf, docLen)in one access. Happy to switch to a dedicated list if you prefer the literal approach.Usage
{ results as var(func: bm25(description, "machine learning")) q(func: uid(results), orderdesc: val(results), first: 10) { description score: val(results) } }Optional BM25 parameters:
bm25(description, "...", "1.2", "0.75")(k,b).@index(bm25)is rejected on list predicates (a single document length / corpus stat is not well-defined per list element).Testing
queryBM25 integration tests pass against a multi-node cluster; the fullqueryintegration suite passes (no regression from the query-layer changes).🤖 Generated with Claude Code