feat: persist distance_metric in dataset schema metadata#83
Merged
dcfocus merged 1 commit intoJun 12, 2026
Merged
Conversation
The configurable vector-search distance metric (lance-format#74/lance-format#77) was a runtime-only option: callers had to re-pass `distance_metric` on every open, and if they forgot, the store silently fell back to `l2` and ranked results differently from how the dataset was intended to be queried. Persist the metric in the Lance schema metadata (key `lance-context:distance_metric`) on create, the same mechanism already used for blob encoding. On open it is recovered automatically, mirroring how `embedding_dim` round-trips via the schema. An explicitly passed metric that disagrees with the persisted one now errors, reusing the `embedding_dim` mismatch-validation pattern. Datasets created before this change carry no key and default to `l2`, preserving existing behavior. `ContextStoreOptions.distance_metric` becomes `Option<DistanceMetric>` (None = use the persisted/default metric), matching `embedding_dim: Option<i32>`. Closes lance-format#80 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
25 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Persists the configurable vector-search
distance_metric(added in #74/#77) inthe dataset so it round-trips on
openinstead of being re-specified everytime. Closes #80.
Before this change
distance_metricwas a runtime-only option: a caller had topass the same metric on every
open, and if they forgot, the store silentlyfell back to the default
l2and ranked results differently from how thedataset was intended to be queried —
embedding_dimalready recovers from theschema, but the metric did not.
What changed
metadata under
lance-context:distance_metric(the same mechanism alreadyused for
lance-encoding:blob).distance_metric_from_schemareads it back and uses itas the store's metric, mirroring
embedding_dim_from_schema.persisted one errors, reusing the existing
embedding_dimmismatch-validationpattern in
open_with_options.default to
l2.ContextStoreOptions.distance_metricis nowOption<DistanceMetric>(None= use the persisted/default metric), matching
embedding_dim: Option<i32>.Threaded through unified / server / PyO3 accordingly.
Tests
store.rs):distance_metric_persists_across_reopen(createcosine, reopen without the option → cosine ranking), a mismatch-errorstest, and a unit test that a metadata-less schema defaults to
l2.test_distance_metric.py): reopen-without-option keeps cosineranking; reopening with a conflicting metric raises.
Verification
cargo test -p lance-context-core— 41 passedcargo clippy --all-targets— clean;cargo fmt --check— cleanpytest python/tests/— 109 passed (the 2 S3 failures are pre-existing andenvironmental: no live bucket),
ruff check/ruff format --checkcleanRelationship
Follow-up to #74 / #77. Does not change the metric set or ranking math.
🤖 Generated with Claude Code