lance-format · Xuanwo · Feb 23, 2026 · Feb 3, 2026 · Feb 3, 2026 · Feb 5, 2026
diff --git a/skills/README.md b/skills/README.md
@@ -0,0 +1,13 @@
+# Skills
+
+This directory contains code agent skills for the Lance project.
+
+Each skill is a folder that contains a required `SKILL.md` (with YAML frontmatter) and optional `scripts/`, `references/`, and `assets/`.
+
+## Install
+
+```bash
+npx skills add lance-format/lance
+```
+
+Restart code agents after installing.
diff --git a/skills/lance-user-guide/SKILL.md b/skills/lance-user-guide/SKILL.md
@@ -0,0 +1,227 @@
+---
+name: lance-user-guide
+description: Guide Code Agents to help Lance users write/read datasets and build/choose indices. Use when a user asks how to use Lance (Python/Rust/CLI), how to write_dataset/open/scan, how to build vector indexes (IVF_PQ, IVF_HNSW_*), how to build scalar indexes (BTREE, BITMAP, LABEL_LIST, NGRAM, INVERTED, BLOOMFILTER, RTREE, etc.), how to combine filters with vector search, or how to debug indexing and scan performance.
+---
+
+# Lance User Guide
+
+## Scope
+
+Use this skill to answer questions about:
+
+- Writing datasets (create/append/overwrite) and reading/scanning datasets
+- Vector search (nearest-neighbor queries) and vector index creation/tuning
+- Scalar index creation and choosing a scalar index type for a filter workload
+- Combining filters (metadata predicates) with vector search
+
+Do not use this skill for:
+
+- Contributing to Lance itself (repo development, internal architecture)
+- File format internals beyond what is required to use the API correctly
+
+## Installation (quick)
+
+Python:
+
+```bash
+pip install pylance
+```
+
+Verify:
+
+```bash
+python -c "import lance; print(lance.__version__)"
+```
+
+Rust:
+
+```bash
+cargo add lance
+```
+
+Or add it to `Cargo.toml` (choose an appropriate version for your project):
+
+```toml
+[dependencies]
+lance = "x.y"
+```
+
+From source (this repository):
+
+```bash
+maturin develop -m python/Cargo.toml
+```
+
+## Minimal intake (ask only what you need)
+
+Collect the minimum information required to avoid wrong guidance:
+
+- Language/API surface: Python / Rust / CLI
+- Storage: local filesystem / S3 / other object store
+- Workload: scan-only / filter-heavy / vector search / hybrid (vector + filter)
+- Vector details (if applicable): dimension, metric (L2/cosine/dot), latency target, recall target
+- Update pattern: mostly append / frequent overwrite / frequent deletes/updates
+- Data scale: approximate row count and whether there are many small files
+
+If the user does not specify a language, default to Python examples and provide a short mapping to Rust concepts.
+
+## Workflow decision tree
+
+1. If the question is "How do I write or update data?": use the **Write** playbook.
+2. If the question is "How do I read / scan / filter?": use the **Read** playbook.
+3. If the question is "How do I do kNN / vector search?": use the **Vector search** playbook.
+4. If the question is "Which index should I use?": consult `references/index-selection.md` and confirm constraints.
+5. If the question is "Why is this slow / why are results missing?": use **Troubleshooting** and ask for a minimal reproduction.
+
+## Primary playbooks (Python)
+
+### Write
+
+Prefer `lance.write_dataset` for most user workflows.
+
+```python
+import lance
+import pyarrow as pa
+
+vectors = pa.array(
+    [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]],
+    type=pa.list_(pa.float32(), 3),
+)
+table = pa.table({"id": [1, 2], "vector": vectors, "category": ["a", "b"]})
+
+ds = lance.write_dataset(table, "my-data.lance", mode="create")
+ds = lance.write_dataset(table, "my-data.lance", mode="append")
+ds = lance.write_dataset(table, "my-data.lance", mode="overwrite")
+```
+
+Validation checklist:
+
+- Re-open and count rows: `lance.dataset(uri).count_rows()`
+- Confirm schema: `lance.dataset(uri).schema`
+
+Notes:
+
+- Use `storage_options={...}` when writing to an object store URI.
+- If the user mentions non-atomic object stores, mention `commit_lock` and point them to the user guide.
+
+### Read
+
+Use `lance.dataset` + `scanner(...)` for pushdowns (projection, filter, limit, nearest).
+
+```python
+import lance
+
+ds = lance.dataset("my-data.lance")
+tbl = ds.scanner(
+    columns=["id", "category"],
+    filter="category = 'a' and id >= 10",
+    limit=100,
+).to_table()
+```
+
+Validation checklist:
+
+- If performance is the concern, ask for a minimal `scanner(...)` call that reproduces it.
+- If correctness is the concern, ask for the exact `filter` string and whether `prefilter` is enabled (when using `nearest`).
+
+### Vector search (nearest)
+
+Run vector search with `scanner(nearest=...)` or `to_table(nearest=...)`.
+
+```python
+import lance
+import numpy as np
+
+ds = lance.dataset("my-data.lance")
+q = np.array([1.0, 2.0, 3.0], dtype=np.float32)
+tbl = ds.to_table(nearest={"column": "vector", "q": q, "k": 10})
+```
+
+If combining a filter with vector search, decide whether the filter must run before the vector query:
+
+- Use `prefilter=True` when the filter is highly selective and correctness (top-k among filtered rows) matters.
+- Use `prefilter=False` when the filter is not very selective and speed matters, and accept that results can be fewer than `k`.
+
+```python
+tbl = ds.scanner(
+    nearest={"column": "vector", "q": q, "k": 10},
+    filter="category = 'a'",
+    prefilter=True,
+).to_table()
+```
+
+### Build a vector index
+
+Create a vector index with `LanceDataset.create_index(...)`.
+
+Start with a minimal working configuration:
+
+```python
+ds = lance.dataset("my-data.lance")
+ds = ds.create_index(
+    "vector",
+    index_type="IVF_PQ",
+    target_partition_size=8192,
+    num_sub_vectors=16,
+)
+```
+
+Then verify:
+
+- `ds.describe_indices()` (preferred) or `ds.list_indices()` (can be expensive)
+- A small `nearest` query that uses the index
+
+For parameter selection and tuning, consult `references/index-selection.md`.
+
+### Build a scalar index
+
+Scalar indices speed up scans with filters. Use `create_scalar_index` for a stable entry point.
+
+```python
+ds = lance.dataset("my-data.lance")
+ds.create_scalar_index("category", "BTREE", replace=True)
+```
+
+Then verify:
+
+- `ds.describe_indices()`
+- A representative `scanner(filter=...)` query
+
+To choose a scalar index type (BTREE vs BITMAP vs LABEL_LIST vs NGRAM vs INVERTED, etc.), consult `references/index-selection.md`.
+
+## Troubleshooting patterns
+
+### "Vector search + filter returns fewer than k rows"
+
+- Explain the difference between post-filtering and pre-filtering.
+- Suggest `prefilter=True` if the user expects top-k among filtered rows.
+
+### "Index creation is slow"
+
+- Confirm vector dimension and `num_sub_vectors`.
+- For IVF_PQ, call out the common pitfall: avoid misaligned `dimension / num_sub_vectors` (see `references/index-selection.md`).
+
+### "Scan is slow even with a scalar index"
+
+- Ask whether the filter is compatible with the index (equality vs range vs text search).
+- Suggest checking whether scalar index usage is disabled (`use_scalar_index=False`).
+
+## Local verification (when a repo checkout is available)
+
+When answering API questions, confirm the exact signature and docstrings locally:
+
+- Python I/O entry points: `python/python/lance/dataset.py` (`write_dataset`, `LanceDataset.scanner`)
+- Vector indexing: `python/python/lance/dataset.py` (`create_index`)
+- Scalar indexing: `python/python/lance/dataset.py` (`create_scalar_index`)
+
+Use targeted search:
+
+```bash
+rg -n "def write_dataset\\b|def create_index\\b|def create_scalar_index\\b|def scanner\\b" python/python/lance/dataset.py
+```
+
+## Bundled resources
+
+- Index selection and tuning: `references/index-selection.md`
+- I/O and versioning cheat sheet: `references/io-cheatsheet.md`
+- Runnable minimal example: `scripts/python_end_to_end.py`
diff --git a/skills/lance-user-guide/references/index-selection.md b/skills/lance-user-guide/references/index-selection.md
@@ -0,0 +1,88 @@
+## Index selection (quick)
+
+Use this file when the user asks "which index should I use" or "how do I tune it".
+
+Always confirm:
+
+- The query pattern (filter-only, vector-only, hybrid)
+- Data scale (rows, vector dimension)
+- Update pattern (append vs frequent updates/deletes)
+- Correctness needs (must return top-k within a filtered subset vs best-effort)
+
+## Decision table
+
+| Workload | Recommended starting point | Notes |
+| --- | --- | --- |
+| Filter-only scans (`scanner(filter=...)`) | Create a scalar index on the filtered column | Choose scalar index type based on predicate shape and cardinality |
+| Vector search only (`nearest=...`) on large data | Build a vector index | Start with `IVF_PQ` if you need compression; tune `nprobes` / `refine_factor` |
+| Vector search + selective filter | Scalar index for filter + vector index for search | Use `prefilter=True` when you need true top-k among filtered rows |
+| Vector search + non-selective filter | Vector index only (or scalar index optional) | Consider `prefilter=False` for speed; accept fewer than k results |
+| Text search | Create an `INVERTED` scalar index | Use `full_text_query=...` when available; note that `FTS` is not a universal alias in all SDK versions |
+
+## Vector index types (user-facing summary)
+
+Vector index names typically follow a pattern like `{clustering}_{sub_index}_{quantization}`.
+
+Common combinations:
+
+- `IVF_PQ`: IVF clustering + PQ compression
+- `IVF_HNSW_SQ`: IVF clustering + HNSW + SQ
+- `IVF_SQ`: IVF clustering + SQ
+- `IVF_RQ`: IVF clustering + RQ
+- `IVF_FLAT`: IVF clustering + no quantization (exact vectors within clusters)
+
+If you are unsure which types are supported in the user's environment, recommend starting with `IVF_PQ` and fall back to "try and see" (the API will error on unsupported types).
+
+## Vector index creation defaults
+
+Start with:
+
+- `index_type="IVF_PQ"`
+- `target_partition_size`: start with 8192 and adjust based on the dataset size and latency/recall needs
+- `num_sub_vectors`: choose a value that divides the vector dimension
+
+Practical warning (performance):
+
+- Avoid misalignment: `(dimension / num_sub_vectors) % 8 == 0` is a common sweet spot for faster index creation.
+
+## Vector search tuning defaults
+
+Tune recall vs latency with:
+
+- `nprobes`: how many IVF partitions to search
+- `refine_factor`: how many candidates to re-rank to improve accuracy
+
+When a user reports "too slow" or "bad recall", ask for:
+
+- Current `nprobes`, `refine_factor`, and index type
+- Whether the query is using `prefilter`
+
+## Scalar index selection (starting guidance)
+
+Choose scalar index type based on the filter expression:
+
+- Equality filters on high-cardinality columns: start with `BTREE`
+- Equality / IN-list filters on low-cardinality columns: start with `BITMAP`
+- List membership filters on list-like columns: start with `LABEL_LIST`
+- Substring / `contains(...)` filters on strings: start with `NGRAM`
+- Full-text search (FTS): start with `INVERTED`
+- Range filters: start with range-friendly options (for example `ZONEMAP` when appropriate)
+- Highly selective negative membership / presence checks: consider `BLOOMFILTER` (inexact)
+- Geospatial queries (if present in your build): use `RTREE`
+
+## JSON fields
+
+Lance scalar indices are created on physical columns. If you want to index a JSON sub-field:
+
+1. Materialize the extracted value into a new column (for example with `add_columns`)
+2. Create a scalar index on that new column
+
+Example (Python, using SQL expressions):
+
+```python
+ds = lance.dataset(uri)
+ds.add_columns({"country": "json_extract(payload, '$.country')"})
+ds.create_scalar_index("country", "BTREE", replace=True)
+```
+
+If you cannot confidently map the filter to an index type, recommend `BTREE` as a safe baseline and confirm via a small benchmark on representative queries.