Skip to content

fix: merge_insert silently drops matches when a leading payload column is all-null#7251

Merged
wjones127 merged 2 commits into
lance-format:mainfrom
Ar-maan05:fix/merge-insert-partial-schema-3515
Jun 16, 2026
Merged

fix: merge_insert silently drops matches when a leading payload column is all-null#7251
wjones127 merged 2 commits into
lance-format:mainfrom
Ar-maan05:fix/merge-insert-partial-schema-3515

Conversation

@Ar-maan05

@Ar-maan05 Ar-maan05 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

fix: merge_insert silently drops matches when a leading payload column is all-null

Problem

A partial-schema merge_insert (when_matched_update_all) against a table that has a scalar index on the join key can silently update 0 rows, no error, no warning, when the first column of the source is all-null. Dropping the index makes it work again.

Reported as lancedb/lancedb#3515 (and the related lancedb/lancedb#3177).

Minimal repro (from the lancedb issue):

schema = pa.schema([
    pa.field("vector", pa.list_(pa.float32(), 4), nullable=True),  # all None
    pa.field("path", pa.string(), nullable=False),                 # join key
    pa.field("status", pa.utf8()),
    pa.field("file_size", pa.int64()),
])
tbl = db.create_table("test", schema=schema)
tbl.add(...)                                  # 1000 rows, vector = None
tbl.create_scalar_index("path", index_type="BTREE")
tbl.merge_insert("path").when_matched_update_all().execute(updates)  # 128 rows
# -> num_updated_rows == 0   (expected 128)

Root cause

A scalar index on the join key routes the merge through the legacy Merger (see can_use_create_plan: would_use_scalar_index disables the v2 fast path). The Merger reads a full-outer-join stream and, for each row, decides whether the row came from the source side, the target side, or both, by checking whether the join keys are NULL-padded.

But extract_selections checked the columns at positions [0, num_keys) instead of the actual key columns:

let in_left  = Self::not_all_null(combined_batch, 0, num_keys)?;
let in_right = Self::not_all_null(combined_batch, right_offset, num_keys)?;

This assumes the key columns are physically first. They are not: a partial-schema source preserves the user's column order, so here column 0 is vector. On the target side that column is all-null (the original rows were inserted with vector = None), so in_right was false for every matched row -> in_both empty -> 0 updates, silently.

The existing full-schema indexed test only passed by luck: its column 0 happened to be non-null on both sides.

Fix

Locate the join-key columns by name and test those (the target half carries the same columns in the same order, offset by right_offset):

let source_key_cols = self.params.on.iter()
    .map(|key| combined_batch.schema().index_of(key))...;
let target_key_cols = source_key_cols.iter().map(|c| c + right_offset)...;
let in_left  = Self::not_all_null(combined_batch, &source_key_cols)?;
let in_right = Self::not_all_null(combined_batch, &target_key_cols)?;

not_all_null now takes an explicit column-index slice instead of a contiguous (offset, len) range.

Tests

Added test_repro_3515_partial_schema_fully_indexed, parameterized over storage versions V2_0 / V2_1 / V2_2, mirroring the issue (all-null leading vector column, scalar index covering every fragment, partial-schema update). It fails on main (0 updates) and passes with the fix.

All 143 tests in the merge_insert module pass; cargo fmt --all --check and cargo clippy -p lance are clean.

…n is all-null

## Problem

A partial-schema `merge_insert` (`when_matched_update_all`) against a table that
has a scalar index on the join key can silently update **0 rows** — no error, no
warning — when the first column of the source is all-null. Dropping the index
makes it work again.

Reported as lancedb/lancedb#3515 (and the related lancedb/lancedb#3177).

Minimal repro (from the lancedb issue):

```python
schema = pa.schema([
    pa.field("vector", pa.list_(pa.float32(), 4), nullable=True),  # all None
    pa.field("path", pa.string(), nullable=False),                 # join key
    pa.field("status", pa.utf8()),
    pa.field("file_size", pa.int64()),
])
tbl = db.create_table("test", schema=schema)
tbl.add(...)                                  # 1000 rows, vector = None
tbl.create_scalar_index("path", index_type="BTREE")
tbl.merge_insert("path").when_matched_update_all().execute(updates)  # 128 rows
# -> num_updated_rows == 0   (expected 128)
```

## Root cause

A scalar index on the join key routes the merge through the legacy `Merger`
(see `can_use_create_plan`: `would_use_scalar_index` disables the v2 fast path).
The `Merger` reads a full-outer-join stream and, for each row, decides whether
the row came from the source side, the target side, or both, by checking whether
the join **keys** are NULL-padded.

But `extract_selections` checked the columns at positions `[0, num_keys)` instead
of the actual key columns:

```rust
let in_left  = Self::not_all_null(combined_batch, 0, num_keys)?;
let in_right = Self::not_all_null(combined_batch, right_offset, num_keys)?;
```

This assumes the key columns are physically first. They are not: a partial-schema
source preserves the user's column order, so here column 0 is `vector`. On the
target side that column is all-null (the original rows were inserted with
`vector = None`), so `in_right` was `false` for **every matched row** →
`in_both` empty → 0 updates, silently.

The existing full-schema indexed test only passed by luck: its column 0 happened
to be non-null on both sides.

## Fix

Locate the join-key columns by name and test those (the target half carries the
same columns in the same order, offset by `right_offset`):

```rust
let source_key_cols = self.params.on.iter()
    .map(|key| combined_batch.schema().index_of(key))...;
let target_key_cols = source_key_cols.iter().map(|c| c + right_offset)...;
let in_left  = Self::not_all_null(combined_batch, &source_key_cols)?;
let in_right = Self::not_all_null(combined_batch, &target_key_cols)?;
```

`not_all_null` now takes an explicit column-index slice instead of a contiguous
`(offset, len)` range.

## Tests

Added `test_repro_3515_partial_schema_fully_indexed`, parameterized over storage
versions V2_0 / V2_1 / V2_2, mirroring the issue (all-null leading vector column,
scalar index covering every fragment, partial-schema update). It fails on `main`
(0 updates) and passes with the fix.

All 143 tests in the `merge_insert` module pass; `cargo fmt --all --check` and
`cargo clippy -p lance` are clean.
@github-actions github-actions Bot added the bug Something isn't working label Jun 12, 2026
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.83333% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/write/merge_insert.rs 70.83% 2 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

@wjones127 wjones127 added the critical-fix Bugs that cause crashes, security vulnerabilities, or incorrect data. label Jun 16, 2026
@wjones127 wjones127 merged commit 1f98d89 into lance-format:main Jun 16, 2026
31 checks passed
@Ar-maan05 Ar-maan05 deleted the fix/merge-insert-partial-schema-3515 branch June 16, 2026 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working critical-fix Bugs that cause crashes, security vulnerabilities, or incorrect data.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants