Skip to content

feat: add lance_dataset_alter_columns for rename / nullability / type changes#44

Merged
jja725 merged 1 commit into
lance-format:mainfrom
LuciferYang:feat/dataset-alter-columns
Jun 8, 2026
Merged

feat: add lance_dataset_alter_columns for rename / nullability / type changes#44
jja725 merged 1 commit into
lance-format:mainfrom
LuciferYang:feat/dataset-alter-columns

Conversation

@LuciferYang

Copy link
Copy Markdown
Contributor

Summary

Second of three PRs against #41. Exposes upstream's Dataset::alter_columns — rename a column, change its nullability, or change its data type, committing a new manifest. Rename and nullability-only changes are zero-copy and preserve indices on the affected column; a type change rewrites the column's data files and drops any associated indices, mirroring upstream behavior.

Mutates the dataset in place under an exclusive write lock; scanners already in flight against it keep their pre-alteration view via the existing Arc clone-on-write, same as _delete (#31) / _drop_columns (#42).

Surface

typedef enum {
    LANCE_COLUMN_NULLABLE_UNCHANGED = 0,
    LANCE_COLUMN_NULLABLE_TRUE      = 1,
    LANCE_COLUMN_NULLABLE_FALSE     = 2,
} LanceColumnNullableMode;

typedef struct LanceColumnAlteration {
    const char* path;                       /* required */
    const char* rename;                     /* NULL = keep */
    int32_t     nullable_mode;              /* LanceColumnNullableMode discriminant */
    const struct ArrowSchema* data_type;    /* NULL = keep */
} LanceColumnAlteration;

int32_t lance_dataset_alter_columns(
    LanceDataset* dataset,
    const LanceColumnAlteration* alterations,
    size_t num_alterations
);

Per-alteration validation runs up front with index-tagged error messages. The struct uses sentinels for the three optional fields (rename = NULL, nullable_mode = UNCHANGED, data_type = NULL); at least one must request a change, so a zero-init struct with only path set is rejected as a no-op rather than silently consuming a manifest version.

Two design choices worth calling out:

  • nullable_mode is int32_t, not the enum directly. The struct is read across the FFI boundary, and Rust treats a #[repr(C)] enum read from C with an out-of-range discriminant as UB. So the field is int32_t and a LanceColumnNullableMode::from_raw(i32) helper converts and returns INVALID_ARGUMENT for unknown values — same pattern as merge_insert's WhenMatched::from_raw.

  • data_type borrows an Arrow ArrowSchema. The wrapper never calls its release callback. Before handing the pointer to arrow-rs, the wrapper checks both release == NULL (the Arrow CADI "released" sentinel) and format == NULL (catches FFI_ArrowSchema::empty() and other half-built structs that would otherwise hit an assert! in DataType::try_from and abort the host process under panic = "abort").

The C++ wrapper takes const std::vector<lance::ColumnAlteration>& and uses the same direct-pass convention as update/merge_insert siblings — raw.data() unconditionally; an empty vector flows through the Rust-side num_alterations == 0 guard so the error message is precise.

Tests

Nineteen new Rust integration tests cover the positive paths (rename, relax / tighten nullability, Int32→Int64 upcast with value round-trip, combined rename+relax, multi-alteration per call, version bump) and the full rejection surface (NULL dataset / NULL array / zero count / NULL path / empty path / empty rename / unknown column / incompatible cast / no-op alteration with schema-unchanged assertion / invalid nullable_mode discriminant / uninitialised FFI_ArrowSchema / tightening nullability when existing rows hold NULLs).

C and C++ smoke tests slot in before test_drop_columns, relax id to nullable, and verify via ArrowSchema.flags & ARROW_FLAG_NULLABLE. Both also exercise the NULL / zero / no-op / bad-discriminant negative paths. cargo test and cargo test --test compile_and_run_test -- --ignored both green.

Follow-up

  • lance_dataset_add_columns — SQL expressions / AllNulls / ArrowArrayStream

The README roadmap entry stays unticked until that lands.

… changes

Second of three PRs covering the schema-evolution roadmap entry. Exposes
upstream's `Dataset::alter_columns` — rename, change nullability, or
change data type of one or more columns in a single manifest commit.
Rename and nullability-only changes are zero-copy and preserve indices;
a type change rewrites the column's data files and drops any indices
that referenced it.
@LuciferYang

Copy link
Copy Markdown
Contributor Author

cc @jja725

@jja725 jja725 merged commit 68da0b3 into lance-format:main Jun 8, 2026
9 checks passed
@jja725

jja725 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

I've been vacation and thanks for the contribution @LuciferYang

@LuciferYang

Copy link
Copy Markdown
Contributor Author

Thank you @jja725

jja725 pushed a commit that referenced this pull request Jun 11, 2026
…n addition (#45)

## Summary

Last of three PRs against #41. Exposes upstream's `Dataset::add_columns`
through the three `NewColumnTransform` cases that translate cleanly
across the C ABI:

- **SQL expressions** — derive new columns from SQL over existing
columns.
- **All-null columns** — add nullable columns from an Arrow schema. On
the modern format this is metadata-only; the legacy format can't
represent it that way and returns `LANCE_ERR_NOT_SUPPORTED`.
- **Stream** — splice in precomputed column data from an Arrow C stream,
aligned positionally to the dataset's existing rows.

Upstream's fourth variant, `BatchUDF`, is left out on purpose: it
carries a Rust closure that can't cross the C ABI, and the stream
variant already covers the same "bring your own computed data" use case.

Each call mutates the dataset in place under an exclusive write lock;
scanners already in flight keep their pre-add view via the same Arc
clone-on-write as the `_drop_columns` (#42) / `_alter_columns` (#44)
siblings.

Three focused entry points rather than one mode-tagged function, because
the inputs are genuinely different shapes (name/expression pairs vs. a
schema pointer vs. a stream). Upstream's `read_columns` parameter isn't
exposed — it only feeds `BatchUDF`; for these three variants upstream
ignores it. `batch_size` is forwarded where it does something (SQL scan,
stream alignment) and omitted from the metadata-only all-null path.

## Surface

```c
typedef struct LanceSqlColumn {
    const char* name;
    const char* expression;
} LanceSqlColumn;

int32_t lance_dataset_add_columns_sql(
    LanceDataset* dataset, const LanceSqlColumn* columns,
    size_t num_columns, uint64_t batch_size);

int32_t lance_dataset_add_columns_nulls(
    LanceDataset* dataset, const struct ArrowSchema* schema);

int32_t lance_dataset_add_columns_stream(
    LanceDataset* dataset, struct ArrowArrayStream* stream,
    uint64_t batch_size);
```

`batch_size` uses `0` for the upstream default and is range-checked to
`u32`. Two error-code details are worth calling out, both matching
existing behavior: a SQL expression referencing a non-existent column
surfaces as `LANCE_ERR_INTERNAL` (an upstream schema error, the same
path as `lance_dataset_delete`), whereas a syntax error is
`LANCE_ERR_INVALID_ARGUMENT`.

Because these are `unsafe extern "C"` entry points under `panic =
"abort"`, the stream variant pre-validates the mandatory CADI callbacks
before handing the stream to arrow-rs (which would otherwise abort on a
NULL `get_schema` / `get_next`), and the all-null variant rejects an
uninitialised or non-UTF-8 top-level schema `format` before arrow-rs's
`assert!`/`expect` can fire. The stream is consumed (released) on every
non-NULL return path.

## Tests

Rust integration tests cover all three variants end to end: computed
values (single and multi-column SQL, constant expressions, honored
`batch_size`), all-null backfill, and multi-fragment stream alignment —
plus the full rejection surface (NULL/empty/non-UTF-8 inputs, name
collisions, row-count mismatch, invalid `batch_size`,
released/missing-callback streams, non-nullable all-null fields, and the
legacy-format `NOT_SUPPORTED` path). Stream-consumption is proven with a
drop counter rather than a vacuous release-slot check. C and C++ smoke
tests exercise the SQL happy path and each variant's argument rejections
across the ABI.
jja725 pushed a commit that referenced this pull request Jun 16, 2026
The schema-evolution series is fully merged, so this flips the Phase 3
roadmap row from `[ ]` to `[x]` and names the functions that cover it,
matching the style of the rows around it.

Covered by:
- #45 — `lance_dataset_add_columns_sql/_nulls/_stream`
- #44 — `lance_dataset_alter_columns`
- #42 — `lance_dataset_drop_columns`

Closes #41.

Docs-only; no code or test changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants