Epic: Vortex DuckLake - maybe?

**Main Changes**

1. **Carry data file format through DuckLake metadata**

   DuckLake’s metadata schema already has `ducklake_data_file.file_format`, and the spec documents it, currently effectively as Parquet-only: [DuckLake data_file spec](https://ducklake.select/docs/stable/specification/tables/ducklake_data_file.html).

   But the implementation discards/hardcodes it. `DuckLakeDataFile` has no data-format field, and metadata flush writes literal `"parquet"` in [ducklake_metadata_manager.cpp](https://github.com/duckdb/ducklake/blob/main/src/storage/ducklake_metadata_manager.cpp). DuckLake would need a `DataFileFormat` field, probably enum or normalized string, carried through file select/read/list/flush paths.

2. **Make DuckLake insert/write choose Vortex**

   `DuckLakeInsert::GetCopyOptions` currently hardcodes Parquet:

   - `info->format = "parquet"`
   - `GetCopyFunction(context, "parquet")`
   - `file_extension = "parquet"`
   - Parquet-specific options like field IDs and encryption config

   Source: [ducklake_insert.cpp](https://github.com/duckdb/ducklake/blob/main/src/storage/ducklake_insert.cpp).

   For Vortex, DuckLake needs a table/database option such as `data_file_format = 'vortex'`, then dispatch to DuckDB’s Vortex copy function with `.vortex` output files.

3. **Extend the Vortex copy function to return written file statistics**

   Related issue: [vortex-data/vortex#7819](https://github.com/vortex-data/vortex/issues/7819)

   DuckLake plans inserts as `COPY ... RETURN_STATS`. DuckDB requires the copy function to implement `copy_to_get_written_statistics`; otherwise binding rejects it. The core API is in [copy_function.hpp](https://github.com/duckdb/duckdb/blob/v1.5.2/src/include/duckdb/function/copy_function.hpp#L163), and the binder check is in [bind_copy.cpp](https://github.com/duckdb/duckdb/blob/v1.5.2/src/planner/binder/statement/bind_copy.cpp#L303).

   Current Vortex copy registration is minimal in [copy_function.cpp](https://github.com/vortex-data/vortex/blob/a7c4c76aec5d500116f8b6285f691932416f63e0/vortex-duckdb/cpp/copy_function.cpp#L141), and Rust finalize discards the `WriteSummary` in [copy.rs](https://github.com/vortex-data/vortex/blob/a7c4c76aec5d500116f8b6285f691932416f63e0/vortex-duckdb/src/copy.rs#L102). This needs to expose at least row count, file size, footer size, partition keys, and ideally column stats. Vortex already has most of the write-side data in `WriteSummary` from [writer.rs](https://github.com/vortex-data/vortex/blob/a7c4c76aec5d500116f8b6285f691932416f63e0/vortex-file/src/writer.rs#L462).

4. **Expose Vortex as a DuckDB multi-file reader, not only a table function**

   DuckLake scan currently clones `parquet_scan` and injects its own `DuckLakeMultiFileReader`: [ducklake_scan.cpp](https://github.com/duckdb/ducklake/blob/main/src/storage/ducklake_scan.cpp). That lets DuckLake layer delete filters, schema mapping, row IDs, snapshot IDs, and file metadata over Parquet.

   Current `vortex-duckdb` registers `read_vortex`/`vortex_scan` as regular table functions in [lib.rs](https://github.com/vortex-data/vortex/blob/a7c4c76aec5d500116f8b6285f691932416f63e0/vortex-duckdb/src/lib.rs#L51), with a plain DuckDB `TableFunction` wrapper in [table_function.cpp](https://github.com/vortex-data/vortex/blob/a7c4c76aec5d500116f8b6285f691932416f63e0/vortex-duckdb/cpp/table_function.cpp#L384). It does not expose a DuckDB `MultiFileReader`/`BaseFileReader` hook.

   For full DuckLake support, Vortex should grow a DuckDB file reader path that DuckLake can plug into its multi-file reader machinery.

5. **Handle DuckLake row IDs, deletes, and snapshot filters**

   DuckLake’s multi-file reader injects delete filters and computes internal row IDs from per-file `row_id_start + file_row_number`: [ducklake_multi_file_reader.cpp](https://github.com/duckdb/ducklake/blob/main/src/storage/ducklake_multi_file_reader.cpp).

   Vortex has `file_row_number` support and `ScanBuilder::with_row_offset`, but the current multi-file Vortex datasource is not carrying DuckLake’s per-file row-id metadata. That would need to be bridged.

6. **Handle schema evolution / field IDs**

   DuckLake binds data files using field IDs. Parquet stores field IDs in file metadata; DuckLake currently passes them as Parquet copy options. Vortex files would need an equivalent story: either store DuckLake field IDs in Vortex file metadata or add a DuckLake-side mapping layer that can project Vortex columns by stable field identity.

   This also means Vortex multi-file scanning needs to relax its current exact-dtype expectation. The existing TODOs in [vortex-layout/src/scan/multi.rs](https://github.com/vortex-data/vortex/blob/a7c4c76aec5d500116f8b6285f691932416f63e0/vortex-layout/src/scan/multi.rs#L12) already call out schema union, virtual columns, and per-file stats.

7. **Add Vortex metadata extraction for `ducklake_add_data_files`**

   DuckLake’s external add-files path is Parquet-specific: it calls `parquet_full_metadata`, parses Parquet schema/stats, then constructs `DuckLakeDataFile`. Source: [ducklake_add_data_files.cpp](https://github.com/duckdb/ducklake/blob/main/src/functions/ducklake_add_data_files.cpp).

   To add existing `.vortex` files, DuckLake needs a `vortex_full_metadata` equivalent or a reusable C++ API that returns row count, file size, footer size, schema/field mapping, and column stats.

**Likely MVP**

The narrowest useful MVP would be Vortex-only DuckLake tables with append and scan:

- DuckLake option: `data_file_format = 'vortex'`
- Vortex `COPY TO` returns DuckDB written-file statistics
- DuckLake writes `.vortex` and stores `file_format = 'vortex'`
- DuckLake scan dispatches to a Vortex multi-file reader
- Initially reject mixed Parquet/Vortex tables, add-existing-files, encryption, and possibly deletes/schema evolution until those paths are implemented

For full DuckLake semantics, the bigger pieces are the Vortex DuckDB multi-file reader integration and format-neutralizing the Parquet assumptions inside DuckLake’s write/add-files/metadata paths.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Vortex DuckLake - maybe? #7820

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Epic: Vortex DuckLake - maybe? #7820

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions