-
Notifications
You must be signed in to change notification settings - Fork 582
docs: add lance skills as user guide #5877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
95cf8b6
docs: add lance skills as user guide
Xuanwo cd00ef6
docs(skills): use target_partition_size for vector index
Xuanwo fb7da11
docs(skills): clarify installation and compatibility
Xuanwo 7edcc66
Revert "docs(skills): clarify installation and compatibility"
Xuanwo b3ee727
Address comments
Xuanwo fb079ec
Merge branch 'main' into luban/chat-aunt
Xuanwo 494065f
Update skills/lance-user-guide/references/index-selection.md
Xuanwo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| # Skills | ||
|
|
||
| This directory contains code agent skills for the Lance project. | ||
|
|
||
| Each skill is a folder that contains a required `SKILL.md` (with YAML frontmatter) and optional `scripts/`, `references/`, and `assets/`. | ||
|
|
||
| ## Install | ||
|
|
||
| ```bash | ||
| npx skills add lance-format/lance | ||
| ``` | ||
|
|
||
| Restart code agents after installing. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,227 @@ | ||
| --- | ||
| name: lance-user-guide | ||
| description: Guide Code Agents to help Lance users write/read datasets and build/choose indices. Use when a user asks how to use Lance (Python/Rust/CLI), how to write_dataset/open/scan, how to build vector indexes (IVF_PQ, IVF_HNSW_*), how to build scalar indexes (BTREE, BITMAP, LABEL_LIST, NGRAM, INVERTED, BLOOMFILTER, RTREE, etc.), how to combine filters with vector search, or how to debug indexing and scan performance. | ||
| --- | ||
|
|
||
| # Lance User Guide | ||
|
|
||
| ## Scope | ||
|
|
||
| Use this skill to answer questions about: | ||
|
|
||
| - Writing datasets (create/append/overwrite) and reading/scanning datasets | ||
| - Vector search (nearest-neighbor queries) and vector index creation/tuning | ||
| - Scalar index creation and choosing a scalar index type for a filter workload | ||
| - Combining filters (metadata predicates) with vector search | ||
|
|
||
| Do not use this skill for: | ||
|
|
||
| - Contributing to Lance itself (repo development, internal architecture) | ||
| - File format internals beyond what is required to use the API correctly | ||
|
|
||
| ## Installation (quick) | ||
|
|
||
| Python: | ||
|
|
||
| ```bash | ||
| pip install pylance | ||
| ``` | ||
|
|
||
| Verify: | ||
|
|
||
| ```bash | ||
| python -c "import lance; print(lance.__version__)" | ||
| ``` | ||
|
|
||
| Rust: | ||
|
|
||
| ```bash | ||
| cargo add lance | ||
| ``` | ||
|
|
||
| Or add it to `Cargo.toml` (choose an appropriate version for your project): | ||
|
|
||
| ```toml | ||
| [dependencies] | ||
| lance = "x.y" | ||
| ``` | ||
|
|
||
| From source (this repository): | ||
|
|
||
| ```bash | ||
| maturin develop -m python/Cargo.toml | ||
| ``` | ||
|
|
||
| ## Minimal intake (ask only what you need) | ||
|
|
||
| Collect the minimum information required to avoid wrong guidance: | ||
|
|
||
| - Language/API surface: Python / Rust / CLI | ||
| - Storage: local filesystem / S3 / other object store | ||
| - Workload: scan-only / filter-heavy / vector search / hybrid (vector + filter) | ||
| - Vector details (if applicable): dimension, metric (L2/cosine/dot), latency target, recall target | ||
| - Update pattern: mostly append / frequent overwrite / frequent deletes/updates | ||
| - Data scale: approximate row count and whether there are many small files | ||
|
|
||
| If the user does not specify a language, default to Python examples and provide a short mapping to Rust concepts. | ||
|
|
||
| ## Workflow decision tree | ||
|
|
||
| 1. If the question is "How do I write or update data?": use the **Write** playbook. | ||
| 2. If the question is "How do I read / scan / filter?": use the **Read** playbook. | ||
| 3. If the question is "How do I do kNN / vector search?": use the **Vector search** playbook. | ||
| 4. If the question is "Which index should I use?": consult `references/index-selection.md` and confirm constraints. | ||
| 5. If the question is "Why is this slow / why are results missing?": use **Troubleshooting** and ask for a minimal reproduction. | ||
|
|
||
| ## Primary playbooks (Python) | ||
|
|
||
| ### Write | ||
|
|
||
| Prefer `lance.write_dataset` for most user workflows. | ||
|
|
||
| ```python | ||
| import lance | ||
| import pyarrow as pa | ||
|
|
||
| vectors = pa.array( | ||
| [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], | ||
| type=pa.list_(pa.float32(), 3), | ||
| ) | ||
| table = pa.table({"id": [1, 2], "vector": vectors, "category": ["a", "b"]}) | ||
|
|
||
| ds = lance.write_dataset(table, "my-data.lance", mode="create") | ||
| ds = lance.write_dataset(table, "my-data.lance", mode="append") | ||
| ds = lance.write_dataset(table, "my-data.lance", mode="overwrite") | ||
| ``` | ||
|
|
||
| Validation checklist: | ||
|
|
||
| - Re-open and count rows: `lance.dataset(uri).count_rows()` | ||
| - Confirm schema: `lance.dataset(uri).schema` | ||
|
|
||
| Notes: | ||
|
|
||
| - Use `storage_options={...}` when writing to an object store URI. | ||
| - If the user mentions non-atomic object stores, mention `commit_lock` and point them to the user guide. | ||
|
|
||
| ### Read | ||
|
|
||
| Use `lance.dataset` + `scanner(...)` for pushdowns (projection, filter, limit, nearest). | ||
|
|
||
| ```python | ||
| import lance | ||
|
|
||
| ds = lance.dataset("my-data.lance") | ||
| tbl = ds.scanner( | ||
| columns=["id", "category"], | ||
| filter="category = 'a' and id >= 10", | ||
| limit=100, | ||
| ).to_table() | ||
| ``` | ||
|
|
||
| Validation checklist: | ||
|
|
||
| - If performance is the concern, ask for a minimal `scanner(...)` call that reproduces it. | ||
| - If correctness is the concern, ask for the exact `filter` string and whether `prefilter` is enabled (when using `nearest`). | ||
|
|
||
| ### Vector search (nearest) | ||
|
|
||
| Run vector search with `scanner(nearest=...)` or `to_table(nearest=...)`. | ||
|
|
||
| ```python | ||
| import lance | ||
| import numpy as np | ||
|
|
||
| ds = lance.dataset("my-data.lance") | ||
| q = np.array([1.0, 2.0, 3.0], dtype=np.float32) | ||
| tbl = ds.to_table(nearest={"column": "vector", "q": q, "k": 10}) | ||
| ``` | ||
|
|
||
| If combining a filter with vector search, decide whether the filter must run before the vector query: | ||
|
|
||
| - Use `prefilter=True` when the filter is highly selective and correctness (top-k among filtered rows) matters. | ||
| - Use `prefilter=False` when the filter is not very selective and speed matters, and accept that results can be fewer than `k`. | ||
|
|
||
| ```python | ||
| tbl = ds.scanner( | ||
| nearest={"column": "vector", "q": q, "k": 10}, | ||
| filter="category = 'a'", | ||
| prefilter=True, | ||
| ).to_table() | ||
| ``` | ||
|
|
||
| ### Build a vector index | ||
|
|
||
| Create a vector index with `LanceDataset.create_index(...)`. | ||
|
|
||
| Start with a minimal working configuration: | ||
|
|
||
| ```python | ||
| ds = lance.dataset("my-data.lance") | ||
| ds = ds.create_index( | ||
| "vector", | ||
| index_type="IVF_PQ", | ||
| target_partition_size=8192, | ||
| num_sub_vectors=16, | ||
| ) | ||
| ``` | ||
|
|
||
| Then verify: | ||
|
|
||
| - `ds.describe_indices()` (preferred) or `ds.list_indices()` (can be expensive) | ||
| - A small `nearest` query that uses the index | ||
|
|
||
| For parameter selection and tuning, consult `references/index-selection.md`. | ||
|
|
||
| ### Build a scalar index | ||
|
|
||
| Scalar indices speed up scans with filters. Use `create_scalar_index` for a stable entry point. | ||
|
|
||
| ```python | ||
| ds = lance.dataset("my-data.lance") | ||
| ds.create_scalar_index("category", "BTREE", replace=True) | ||
| ``` | ||
|
|
||
| Then verify: | ||
|
|
||
| - `ds.describe_indices()` | ||
| - A representative `scanner(filter=...)` query | ||
|
|
||
| To choose a scalar index type (BTREE vs BITMAP vs LABEL_LIST vs NGRAM vs INVERTED, etc.), consult `references/index-selection.md`. | ||
|
|
||
| ## Troubleshooting patterns | ||
|
|
||
| ### "Vector search + filter returns fewer than k rows" | ||
|
|
||
| - Explain the difference between post-filtering and pre-filtering. | ||
| - Suggest `prefilter=True` if the user expects top-k among filtered rows. | ||
|
|
||
| ### "Index creation is slow" | ||
|
|
||
| - Confirm vector dimension and `num_sub_vectors`. | ||
| - For IVF_PQ, call out the common pitfall: avoid misaligned `dimension / num_sub_vectors` (see `references/index-selection.md`). | ||
|
|
||
| ### "Scan is slow even with a scalar index" | ||
|
|
||
| - Ask whether the filter is compatible with the index (equality vs range vs text search). | ||
| - Suggest checking whether scalar index usage is disabled (`use_scalar_index=False`). | ||
|
|
||
| ## Local verification (when a repo checkout is available) | ||
|
|
||
| When answering API questions, confirm the exact signature and docstrings locally: | ||
|
|
||
| - Python I/O entry points: `python/python/lance/dataset.py` (`write_dataset`, `LanceDataset.scanner`) | ||
| - Vector indexing: `python/python/lance/dataset.py` (`create_index`) | ||
| - Scalar indexing: `python/python/lance/dataset.py` (`create_scalar_index`) | ||
|
|
||
| Use targeted search: | ||
|
|
||
| ```bash | ||
| rg -n "def write_dataset\\b|def create_index\\b|def create_scalar_index\\b|def scanner\\b" python/python/lance/dataset.py | ||
| ``` | ||
|
|
||
| ## Bundled resources | ||
|
|
||
| - Index selection and tuning: `references/index-selection.md` | ||
| - I/O and versioning cheat sheet: `references/io-cheatsheet.md` | ||
| - Runnable minimal example: `scripts/python_end_to_end.py` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| ## Index selection (quick) | ||
|
|
||
| Use this file when the user asks "which index should I use" or "how do I tune it". | ||
|
|
||
| Always confirm: | ||
|
|
||
| - The query pattern (filter-only, vector-only, hybrid) | ||
| - Data scale (rows, vector dimension) | ||
| - Update pattern (append vs frequent updates/deletes) | ||
| - Correctness needs (must return top-k within a filtered subset vs best-effort) | ||
|
|
||
| ## Decision table | ||
|
|
||
| | Workload | Recommended starting point | Notes | | ||
| | --- | --- | --- | | ||
| | Filter-only scans (`scanner(filter=...)`) | Create a scalar index on the filtered column | Choose scalar index type based on predicate shape and cardinality | | ||
| | Vector search only (`nearest=...`) on large data | Build a vector index | Start with `IVF_PQ` if you need compression; tune `nprobes` / `refine_factor` | | ||
| | Vector search + selective filter | Scalar index for filter + vector index for search | Use `prefilter=True` when you need true top-k among filtered rows | | ||
| | Vector search + non-selective filter | Vector index only (or scalar index optional) | Consider `prefilter=False` for speed; accept fewer than k results | | ||
| | Text search | Create an `INVERTED` scalar index | Use `full_text_query=...` when available; note that `FTS` is not a universal alias in all SDK versions | | ||
|
|
||
| ## Vector index types (user-facing summary) | ||
|
|
||
| Vector index names typically follow a pattern like `{clustering}_{sub_index}_{quantization}`. | ||
|
|
||
| Common combinations: | ||
|
|
||
| - `IVF_PQ`: IVF clustering + PQ compression | ||
| - `IVF_HNSW_SQ`: IVF clustering + HNSW + SQ | ||
| - `IVF_SQ`: IVF clustering + SQ | ||
| - `IVF_RQ`: IVF clustering + RQ | ||
| - `IVF_FLAT`: IVF clustering + no quantization (exact vectors within clusters) | ||
|
|
||
| If you are unsure which types are supported in the user's environment, recommend starting with `IVF_PQ` and fall back to "try and see" (the API will error on unsupported types). | ||
|
|
||
| ## Vector index creation defaults | ||
|
|
||
| Start with: | ||
|
|
||
| - `index_type="IVF_PQ"` | ||
| - `target_partition_size`: start with 8192 and adjust based on the dataset size and latency/recall needs | ||
| - `num_sub_vectors`: choose a value that divides the vector dimension | ||
|
|
||
| Practical warning (performance): | ||
|
|
||
| - Avoid misalignment: `(dimension / num_sub_vectors) % 8 == 0` is a common sweet spot for faster index creation. | ||
|
|
||
| ## Vector search tuning defaults | ||
|
|
||
| Tune recall vs latency with: | ||
|
|
||
| - `nprobes`: how many IVF partitions to search | ||
| - `refine_factor`: how many candidates to re-rank to improve accuracy | ||
|
|
||
| When a user reports "too slow" or "bad recall", ask for: | ||
|
|
||
| - Current `nprobes`, `refine_factor`, and index type | ||
| - Whether the query is using `prefilter` | ||
|
|
||
| ## Scalar index selection (starting guidance) | ||
|
|
||
| Choose scalar index type based on the filter expression: | ||
|
|
||
| - Equality filters on high-cardinality columns: start with `BTREE` | ||
| - Equality / IN-list filters on low-cardinality columns: start with `BITMAP` | ||
| - List membership filters on list-like columns: start with `LABEL_LIST` | ||
| - Substring / `contains(...)` filters on strings: start with `NGRAM` | ||
| - Full-text search (FTS): start with `INVERTED` | ||
| - Range filters: start with range-friendly options (for example `ZONEMAP` when appropriate) | ||
| - Highly selective negative membership / presence checks: consider `BLOOMFILTER` (inexact) | ||
| - Geospatial queries (if present in your build): use `RTREE` | ||
|
|
||
| ## JSON fields | ||
|
|
||
| Lance scalar indices are created on physical columns. If you want to index a JSON sub-field: | ||
|
|
||
| 1. Materialize the extracted value into a new column (for example with `add_columns`) | ||
| 2. Create a scalar index on that new column | ||
|
|
||
| Example (Python, using SQL expressions): | ||
|
|
||
| ```python | ||
| ds = lance.dataset(uri) | ||
| ds.add_columns({"country": "json_extract(payload, '$.country')"}) | ||
| ds.create_scalar_index("country", "BTREE", replace=True) | ||
| ``` | ||
|
|
||
| If you cannot confidently map the filter to an index type, recommend `BTREE` as a safe baseline and confirm via a small benchmark on representative queries. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are missing a few here like label list, bloom filter, rtree. Should also mention how to handle json data index.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL