feat(bm25): BM25 ranked text search on standard posting lists by shaunpatterson · Pull Request #9730 · dgraph-io/dgraph

shaunpatterson · 2026-06-04T01:49:10Z

Supersedes #9652 — reworked per the review feedback there.

What

Adds BM25 ranked full-text search: a new bm25 index tokenizer plus a bm25(predicate, "query", [k], [b]) query function.

How — addressing the #9652 review

The previous version was declined for carrying a parallel storage + retrieval stack alongside the existing one. This rework lands BM25 entirely on Dgraph's standard machinery:

Term postings are normal index posting lists at IndexKey(attr, IdentBM25‖term), written through addMutation — inheriting MVCC, deltas, rollup, splits, backup, and snapshot. Each posting is a REF posting whose value packs (term-frequency, doc-length).
- One small core change makes this work: List.encode() now retains REF postings that carry a value through rollup (previously the value was stripped). This mirrors how faceted postings already coexist in both Pack (uid) and Postings (payload). Covered by a forced-rollup regression test.
No parallel storage stack — bm25block/bm25enc, the LocalCache second write path, the BitBM25Data user-meta, and the dedicated mvcc.go commit branch are gone. MVCC does the delta merging.
Concurrency: two documents sharing a term now commit concurrently without conflict (standard index conflict key fingerprint(key)^uid), removing the race the previous version mitigated only via Raft serialization.
Corpus statistics (N, average document length) are sharded across 32 buckets keyed by uid % 32 to avoid a single write-hot key, and summed at read time.
Scoring is a query-side ranker, not a pseudo-predicate: the bm25 function binds its per-document score to a value variable, so it works through the existing val() system. No bm25_score pseudo-predicate and no new ParentVars channel.
WAND / Block-Max WAND is kept as an optimization on top of the standard posting-list iterator (per-block max-TF / min-docLen bounds computed from the in-memory list) — not a replacement for it.

One deviation worth your call

The review suggested storing document length "next to the predicate's existing length/count machinery." I instead packed docLen into the posting value alongside TF: a single per-predicate doclen list is a write-conflict hotspot and forces a random read per scored candidate, whereas packing lets scoring read (uid, tf, docLen) in one access. Happy to switch to a dedicated list if you prefer the literal approach.

Usage

description: string @index(bm25) .

{
  results as var(func: bm25(description, "machine learning"))
  q(func: uid(results), orderdesc: val(results), first: 10) {
    description
    score: val(results)
  }
}

Optional BM25 parameters: bm25(description, "...", "1.2", "0.75") (k, b). @index(bm25) is rejected on list predicates (a single document length / corpus stat is not well-defined per list element).

Testing

Unit: rollup TF/doc-length survival; bucketed-stats accumulation (including cross-transaction and same-bucket read-your-own-writes); value-codec round trip; a 200-trial randomized WAND/Block-Max-WAND vs brute-force correctness check.
Integration: 34 query BM25 integration tests pass against a multi-node cluster; the full query integration suite passes (no regression from the query-layer changes).

🤖 Generated with Claude Code

Add BM25 relevance-ranked text search to Dgraph, enabling users to query text predicates and receive results ordered by relevance score instead of boolean matching. Implementation: - New BM25 tokenizer using the fulltext pipeline (normalize, stopwords, stem) that preserves term frequencies for TF counting - BM25-specific index storage: per-term TF posting lists, doc length lists, and corpus statistics (doc count, total terms) - Query execution with full BM25 scoring: score = IDF * (k+1) * tf / (k * (1 - b + b * dl/avgDL) + tf) IDF = log1p((N - df + 0.5) / (df + 0.5)) - DQL syntax: bm25(predicate, "query" [, "k", "b"]) as root func or filter - Schema syntax: @index(bm25) - Parameter validation (k > 0, 0 <= b <= 1) - Early UID intersection for filter-mode performance - All-stopword document and query handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three critical bugs fixed: 1. REF postings lose Value during rollup: The posting list encode/rollup cycle strips the Value field from REF postings without facets (list.go:1630). BM25 term frequencies and doc lengths were stored in Value and lost. Fix: Store TF and doclen as facets on REF postings, which are preserved. 2. Missing function validation: query/query.go has a separate isValidFuncName check from dql/parser.go. "bm25" was only added to the parser, causing "Invalid function name: bm25" at query time. 3. Unsorted UIDs break query pipeline: BM25 returned UIDs sorted by score, but the query pipeline (algo.MergeSorted, child predicate fetching) requires UID-ascending order. Fix: Sort UIDs ascending in UidMatrix, apply first/offset pagination on score-sorted results before UID sorting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the facet-based BM25 storage (~40-50 bytes/posting) with compact varint-encoded binary blobs stored as direct Badger KV entries (~4-6 bytes/posting, ~10x reduction). Add bm25_score pseudo-predicate for variable-based score ordering following the similar_to pattern. - Add posting/bm25enc package for compact binary encode/decode - Rewrite write path in posting/index.go for direct Badger KV - Add bm25Writes buffer to LocalCache with read-your-own-writes - Flush BM25 blobs in CommitToDisk with BitBM25Data UserMeta - Rewrite read path in worker/task.go with direct blob decoding - Add bm25_score pseudo-predicate in query/query.go - Add score ordering integration tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…cases Cover incremental add/update/delete, IDF score stability as corpus grows, large corpus pagination, unicode, stopwords, uid filtering, score validation, and concurrent batch adds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…c tests Addresses test coverage gaps identified during code review against ArangoDB's BM25 implementation: - TestBM25ExactScoreValues: validates numerical correctness of BM25 formula using b=0 to enable hand-computed expected scores - TestBM25BM15NoLengthNormalization: verifies b=0 disables length normalization and contrasts with default b=0.75 behavior - TestBM25SingleMatchingDocument: covers df=1 edge case with high IDF Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 1 of BM25 scaling plan. Introduces bm25block package with: - BlockMeta/Dir types for block directory encoding/decoding - SplitIntoBlocks: splits monolithic entry slices into 128-entry blocks - MergeAllBlocks: compacts overlapping blocks with dedup and tombstone removal - ComputeUBPre/SuffixMaxUBPre: WAND upper-bound precomputation - New key functions: BM25TermDirKey, BM25TermBlockKey, BM25DocLenDirKey, BM25DocLenBlockKey for block-addressed Badger KV storage 17 unit tests and benchmarks for the block storage format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phases 2-4 of BM25 scaling plan: Phase 2 - Segmented mutation path: - addBM25IndexMutations now writes to block-based storage - Each term's postings split into ~128-entry blocks with a directory - Blocks automatically split when exceeding 256 entries - Doc-length list also uses block-based storage - Block removal and directory cleanup on deletes Phase 3 - WAND top-k query path: - New bm25wand.go with listIter for block-based posting list iteration - WAND algorithm with min-heap for top-k early termination - Per-block upper bounds (UBPre) computed from maxTF at query time - Suffix-max UBPre for efficient threshold checking - Falls back to scoring all docs when no first: limit or offset is used Phase 4 - Block-Max WAND: - skipToWithBMW skips entire blocks whose UB + other terms can't beat theta - Avoids Badger reads for blocks that can't contribute to top-k - Enabled by default in handleBM25Search Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Phase 5 - Migration support: - newListIter falls back to legacy monolithic blob when no block directory exists - lookupDocLen falls back to legacy BM25DocLenKey blob - wandSearch falls back to legacy BM25IndexKey for df computation - Legacy data transparently served through synthetic single-block directory - New writes always use block format; old data works until overwritten Unit tests for WAND components: - TestTopKHeapBasic: heap operations, threshold, eviction - TestTopKHeapTieBreaking: deterministic ordering on score ties - TestBm25ScoreFunction: formula verification, tf/dl/b edge cases - TestBm25ScoreNaN: no NaN/Inf for edge-case inputs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fixes critical bugs and performance issues identified by GPT-5 review: - Fix negative inBlockPos panic: guard currentDoc/currentTF/skipTo against inBlockPos < 0 (possible before first next() call) - Fix empty block pathological behavior: next()/skipTo()/skipToWithBMW() now skip empty blocks instead of leaving iterator in invalid state with MaxUint64 pivotDoc - Fix legacy loadBlock: no longer resets inBlockPos to 0 (was moving pointer backwards, could cause re-scoring or infinite loops) - Fix remainingUB panic: guard against blockIdx < 0 (before first next()) - Add docLenCache: caches doclen directory + block reads within a single query, avoiding repeated Badger reads per scored document - Optimize BMW otherUB: compute as sumUB - thisUB (O(1)) instead of iterating all other terms (O(q^2) -> O(q)) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…UB underestimate Three fixes: 1. CRITICAL: addBM25IndexMutations now checks if a UID already exists in doclen blocks before incrementing stats, preventing double-counting on SET when the document was already indexed (defensive guard for batch mutations). 2. HIGH: WAND sumUB now accumulates across ALL iterators (not just up to pivot), so BMW's otherUB calculation is correct and won't skip valid candidate blocks. 3. PERF: newListIter accepts pre-read Dir to eliminate duplicate Badger reads (directory was read once for df, then again inside newListIter). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ength Defensive hardening from GPT-5 review: if inBlockPos exceeds block length after next() reaches end of block, the sort.Search span could go negative. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add DecodeCount() to bm25enc for O(1) entry count reads without full decode, preventing OOM on legacy migration with large posting lists (e.g., common terms with millions of entries) - Use DecodeCount in WAND search legacy DF calculation path - Fix integer overflow in DecodeDir bounds check by using uint64 arithmetic (prevents panic on corrupted data with MaxUint32 count) - Pre-allocate shared score buffer in handleBM25Search with three-index slices to prevent accidental append corruption - Document bm25Writes concurrency model and limitations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces the parallel block-storage + retrieval stack (declined in review) with an implementation that rides Dgraph's standard posting-list machinery, addressing the maintainer's feedback (independently endorsed by GPT-5 and Gemini). Net ~1300 fewer lines. Storage / indexing: - BM25 term postings are standard index posting lists at IndexKey(attr, IdentBM25||term), written via the normal delta path, so they inherit MVCC, deltas, rollup, splits, backup and snapshot. Each posting is a REF posting whose value packs (term-frequency, doc-length) as two uvarints. - Fix the linchpin: List.encode() now retains REF postings that carry a value through rollup (otherwise the term frequency was silently stripped). Mirrors how faceted postings already coexist in Pack (uid) + Postings (payload). - Document length is packed into the posting value rather than a separate list, avoiding a write-hot doclen key and a per-candidate random read at query time. - Corpus stats (docCount, totalTerms) are sharded across 32 buckets keyed by uid%32 so concurrent writers rarely contend, while same-bucket updates still conflict-and-retry (mirrors the @count pattern). Term postings get the standard index conflict key (fingerprint(key)^uid), so two docs sharing a term commit concurrently without conflict -- resolving the concurrency regression that the block version only mitigated via Raft serialization. - Delete posting/bm25block and posting/bm25enc; remove LocalCache.bm25Writes, BitBM25Data, and the BM25 commit branch in mvcc.go. Query / scoring: - WAND / Block-Max WAND reworked over the standard posting-list iterator: per-term cursors are materialized from the in-memory List with per-128-posting maxTF/minDocLen bounds -- no parallel block format, no proto changes. - Surface the score via Dgraph's existing value-variable mechanism: the bm25 root function binds its per-doc score to its own variable (Uids + Vals), so `scores as var(func: bm25(...))` works with uid(scores), val(scores) and orderdesc: val(scores). Removes the bm25_score pseudo-predicate and the __bm25_scores__ ParentVars channel. - Skip the query-layer pagination pass for bm25 roots (the worker already paginates over score order), mirroring the existing `has` handling, to avoid double-applying first/offset. Tests: - Rollup TF/doc-length survival, bucketed-stats accumulation (incl. same-bucket in-transaction read-your-own-writes and deletes), value-codec round trip, and a 200-trial randomized WAND/Block-Max-WAND vs brute-force correctness check. - Convert the query integration tests to the value-variable syntax. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

updateBM25Stats maintains the bucketed (docCount, totalTerms) counters via read-modify-write, but read them with LocalCache.GetFromDelta, which skips disk and returns only the current transaction's in-memory delta. Across separately committed mutations each transaction therefore started from zero and overwrote its stats bucket instead of accumulating, collapsing the corpus document count (e.g. to the per-term df). Since avgDL = totalTerms/docCount and idf depends on N = docCount, every length- and idf-weighted BM25 score was wrong, while result ordering (a constant idf factor for a single-term query) still looked correct. Read the committed stats with LocalCache.Get instead. Term postings are unaffected: they are additive deltas that merge through the normal posting-list mechanism and never read-modify-write. Found by the BM25 integration tests (exact-score and uid-filter cases). The prior unit test only exercised a single transaction, where read-your-own-writes masked the bug; add TestBM25StatsAccumulateAcrossTxns covering the multi- transaction path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

BM25 scores a single document (one value) per UID, so per-document length and corpus statistics are ill-defined for a list predicate. The bucketed stats also rely on conflict detection that a list predicate's value-dependent conflict key would not provide (a code-review concern about stats integrity on list/ @noconflict predicates). Reject the combination in checkSchema rather than silently mis-scoring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The redesign rationale (including the doc-length storage decision) lives in code comments and the PR description; the planning doc does not belong in the tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

BM25 term postings live under the IdentBM25 token prefix, but the corpus statistics buckets live under a separate reserved token prefix ("\x00_bm25stats_"). prefixesToDeleteTokensFor only emitted the tokenizer-identifier prefix, so dropping or rebuilding a BM25 index left the stats keys orphaned. On rebuild the fresh stats then accumulated on top of the stale ones, double-counting docCount/totalTerms and corrupting avgDL/IDF. Add x.BM25StatsPrefix and include it (plus its ByteSplit variant) in the bm25 deletion prefixes so the stats are removed together with the term postings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The streaming index rebuild processes a predicate's documents across ~16 goroutines, each with its own transaction/cache that is periodically reset (NewTxn) every ~10k posting lists, emitting deltas to a temp store. BM25 corpus statistics were maintained by a per-transaction read-modify-write of a single value posting per bucket, which collapses under that model: every thread (and every post-reset segment) reads an empty base and writes its own partial total, and the value postings merge last-write-wins — so the rebuilt docCount/totalTerms were severely undercounted, distorting avgDL and IDF. Even a first-time @index(bm25) over existing data produced wrong scores. Route stats through a shared, concurrency-safe accumulator (Txn.bm25Acc, set on the rebuild's per-thread txns) instead of the RMW counter, then flush all 32 buckets once as a single writer. The flush is written into the temp store as ordinary delta postings so the rebuild's second phase rolls it up into pstore at startTs alongside the term postings — no separate commit path. The live mutation path is unchanged and was already correct: on a scalar predicate fingerprintEdge returns the MaxUint64 sentinel for every untagged value, so concurrent same-bucket writers share a conflict key and one retries (this is why @index(bm25) is rejected on list predicates). Added a regression test for that invariant, a unit test for the rebuild accumulator, and integration tests for rebuild-on-existing-data, drop-then-re-add (no stale-stats double counting), and concurrent overlapping mutations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@filter

handleBM25Search only engaged WAND top-k early termination when first was set WITHOUT an offset. Any query with an offset (or a bm25 filter) fell through to scoreAllDocs, which materializes every matching document into a map — and wandSearch already reads each query term's entire posting list into memory. A deep-paginated query over a hot term could therefore allocate and score the whole corpus to return a handful of rows. Engage WAND whenever a first limit is present, retaining first+offset documents and dropping the offset afterward (bm25TopK / bm25PaginateScored, extracted as pure, unit-tested helpers). With no first limit, every match is still scored — the caller explicitly asked for all results. wandSearch always returns score-descending results, so a single pagination path is correct for both the top-k and score-all cases. Also adds TestWandFilteredMatchesBruteForce: the existing WAND brute-force comparison only exercised a nil filter, leaving @filter(bm25(...)) pruning unverified. The new test fuzzes random filter subsets across WAND, Block-Max WAND, and the score-all path against exhaustive filtered scoring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The bulk loader ran BM25 through the generic tokenizer path, which produced a silently broken, unsearchable index: term postings were written as bare REF postings with no value (addMapEntry drops the payload of any REF posting without facets), so the packed (term frequency, document length) was lost and ReadBM25TermPostings skipped every posting; and no corpus statistics were written at all, so docCount was 0 and every bm25() query short-circuited to empty. - addMapEntry now retains REF postings that carry a Value (mirroring the len(p.Value) > 0 clause in List.encode), so BM25 term postings keep their tf/dl. - addIndexMapEntries special-cases the BM25 tokenizer: one posting per distinct term with the packed value, and per-document corpus-stat partials accumulated per mapper. - After the map phase, loader.flushBM25Stats merges every mapper's partials and writes one value posting per (predicate, bucket) into the predicate's map shard, so the reduce emits a single stats posting per bucket in the same format the live and rebuild paths use. Stats are summed across mappers (not unioned like postings), which is the correctness crux, covered by TestMergeBM25Stats. Adds an end-to-end bulk -> bm25() query integration test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shaunpatterson and others added 16 commits June 3, 2026 19:56

fix(bm25): clamp startPos in skipTo to prevent negative sort.Search l…

81b18c1

…ength Defensive hardening from GPT-5 review: if inBlockPos exceeds block length after next() reaches end of block, the sort.Search span could go negative. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore(bm25): drop working design notes

7b063f0

The redesign rationale (including the doc-length storage decision) lives in code comments and the PR description; the planning doc does not belong in the tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shaunpatterson requested a review from a team as a code owner June 4, 2026 01:49

shaunpatterson and others added 5 commits June 6, 2026 21:41

fix(bm25): use tagged switch on err (staticcheck QF1002)

ca19759

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bm25): BM25 ranked text search on standard posting lists#9730

feat(bm25): BM25 ranked text search on standard posting lists#9730
shaunpatterson wants to merge 21 commits into
dgraph-io:mainfrom
shaunpatterson:sp/bm25

shaunpatterson commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

shaunpatterson commented Jun 4, 2026

What

How — addressing the #9652 review

One deviation worth your call

Usage

Testing

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant