Skip to content

feat: persist distance_metric in dataset schema metadata#83

Merged
dcfocus merged 1 commit into
lance-format:mainfrom
dcfocus:worktree-issue-80-persist-metric
Jun 12, 2026
Merged

feat: persist distance_metric in dataset schema metadata#83
dcfocus merged 1 commit into
lance-format:mainfrom
dcfocus:worktree-issue-80-persist-metric

Conversation

@dcfocus

@dcfocus dcfocus commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Persists the configurable vector-search distance_metric (added in #74/#77) in
the dataset so it round-trips on open instead of being re-specified every
time. Closes #80.

Before this change distance_metric was a runtime-only option: a caller had to
pass the same metric on every open, and if they forgot, the store silently
fell back to the default l2 and ranked results differently from how the
dataset was intended to be queried — embedding_dim already recovers from the
schema, but the metric did not.

What changed

  • Persist on create: the metric is written into the Lance schema
    metadata
    under lance-context:distance_metric (the same mechanism already
    used for lance-encoding:blob).
  • Recover on open: distance_metric_from_schema reads it back and uses it
    as the store's metric, mirroring embedding_dim_from_schema.
  • Mismatch guard: an explicitly passed metric that disagrees with the
    persisted one errors, reusing the existing embedding_dim mismatch-validation
    pattern in open_with_options.
  • Backward compatible: datasets created before this change carry no key and
    default to l2.
  • ContextStoreOptions.distance_metric is now Option<DistanceMetric> (None
    = use the persisted/default metric), matching embedding_dim: Option<i32>.
    Threaded through unified / server / PyO3 accordingly.

Tests

  • Rust (store.rs): distance_metric_persists_across_reopen (create
    cosine, reopen without the option → cosine ranking), a mismatch-errors
    test, and a unit test that a metadata-less schema defaults to l2.
  • Python (test_distance_metric.py): reopen-without-option keeps cosine
    ranking; reopening with a conflicting metric raises.

Verification

  • cargo test -p lance-context-core — 41 passed
  • cargo clippy --all-targets — clean; cargo fmt --check — clean
  • pytest python/tests/ — 109 passed (the 2 S3 failures are pre-existing and
    environmental: no live bucket), ruff check / ruff format --check clean

Relationship

Follow-up to #74 / #77. Does not change the metric set or ranking math.

🤖 Generated with Claude Code

The configurable vector-search distance metric (lance-format#74/lance-format#77) was a runtime-only
option: callers had to re-pass `distance_metric` on every open, and if they
forgot, the store silently fell back to `l2` and ranked results differently
from how the dataset was intended to be queried.

Persist the metric in the Lance schema metadata (key
`lance-context:distance_metric`) on create, the same mechanism already used
for blob encoding. On open it is recovered automatically, mirroring how
`embedding_dim` round-trips via the schema. An explicitly passed metric that
disagrees with the persisted one now errors, reusing the `embedding_dim`
mismatch-validation pattern. Datasets created before this change carry no key
and default to `l2`, preserving existing behavior.

`ContextStoreOptions.distance_metric` becomes `Option<DistanceMetric>` (None =
use the persisted/default metric), matching `embedding_dim: Option<i32>`.

Closes lance-format#80

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dcfocus dcfocus merged commit b313c00 into lance-format:main Jun 12, 2026
9 checks passed
@dcfocus dcfocus mentioned this pull request Jun 12, 2026
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persist configured distance_metric in dataset metadata (not just runtime kwarg)

1 participant