-
Carry data file format through DuckLake metadata
DuckLake’s metadata schema already has ducklake_data_file.file_format, and the spec documents it, currently effectively as Parquet-only: DuckLake data_file spec.
But the implementation discards/hardcodes it. DuckLakeDataFile has no data-format field, and metadata flush writes literal "parquet" in ducklake_metadata_manager.cpp. DuckLake would need a DataFileFormat field, probably enum or normalized string, carried through file select/read/list/flush paths.
-
Make DuckLake insert/write choose Vortex
DuckLakeInsert::GetCopyOptions currently hardcodes Parquet:
info->format = "parquet"
GetCopyFunction(context, "parquet")
file_extension = "parquet"
- Parquet-specific options like field IDs and encryption config
Source: ducklake_insert.cpp.
For Vortex, DuckLake needs a table/database option such as data_file_format = 'vortex', then dispatch to DuckDB’s Vortex copy function with .vortex output files.
-
Extend the Vortex copy function to return written file statistics
Related issue: vortex-data/vortex#7819
DuckLake plans inserts as COPY ... RETURN_STATS. DuckDB requires the copy function to implement copy_to_get_written_statistics; otherwise binding rejects it. The core API is in copy_function.hpp, and the binder check is in bind_copy.cpp.
Current Vortex copy registration is minimal in copy_function.cpp, and Rust finalize discards the WriteSummary in copy.rs. This needs to expose at least row count, file size, footer size, partition keys, and ideally column stats. Vortex already has most of the write-side data in WriteSummary from writer.rs.
-
Expose Vortex as a DuckDB multi-file reader, not only a table function
DuckLake scan currently clones parquet_scan and injects its own DuckLakeMultiFileReader: ducklake_scan.cpp. That lets DuckLake layer delete filters, schema mapping, row IDs, snapshot IDs, and file metadata over Parquet.
Current vortex-duckdb registers read_vortex/vortex_scan as regular table functions in lib.rs, with a plain DuckDB TableFunction wrapper in table_function.cpp. It does not expose a DuckDB MultiFileReader/BaseFileReader hook.
For full DuckLake support, Vortex should grow a DuckDB file reader path that DuckLake can plug into its multi-file reader machinery.
-
Handle DuckLake row IDs, deletes, and snapshot filters
DuckLake’s multi-file reader injects delete filters and computes internal row IDs from per-file row_id_start + file_row_number: ducklake_multi_file_reader.cpp.
Vortex has file_row_number support and ScanBuilder::with_row_offset, but the current multi-file Vortex datasource is not carrying DuckLake’s per-file row-id metadata. That would need to be bridged.
-
Handle schema evolution / field IDs
DuckLake binds data files using field IDs. Parquet stores field IDs in file metadata; DuckLake currently passes them as Parquet copy options. Vortex files would need an equivalent story: either store DuckLake field IDs in Vortex file metadata or add a DuckLake-side mapping layer that can project Vortex columns by stable field identity.
This also means Vortex multi-file scanning needs to relax its current exact-dtype expectation. The existing TODOs in vortex-layout/src/scan/multi.rs already call out schema union, virtual columns, and per-file stats.
-
Add Vortex metadata extraction for ducklake_add_data_files
DuckLake’s external add-files path is Parquet-specific: it calls parquet_full_metadata, parses Parquet schema/stats, then constructs DuckLakeDataFile. Source: ducklake_add_data_files.cpp.
To add existing .vortex files, DuckLake needs a vortex_full_metadata equivalent or a reusable C++ API that returns row count, file size, footer size, schema/field mapping, and column stats.
The narrowest useful MVP would be Vortex-only DuckLake tables with append and scan:
For full DuckLake semantics, the bigger pieces are the Vortex DuckDB multi-file reader integration and format-neutralizing the Parquet assumptions inside DuckLake’s write/add-files/metadata paths.
Main Changes
Carry data file format through DuckLake metadata
DuckLake’s metadata schema already has
ducklake_data_file.file_format, and the spec documents it, currently effectively as Parquet-only: DuckLake data_file spec.But the implementation discards/hardcodes it.
DuckLakeDataFilehas no data-format field, and metadata flush writes literal"parquet"in ducklake_metadata_manager.cpp. DuckLake would need aDataFileFormatfield, probably enum or normalized string, carried through file select/read/list/flush paths.Make DuckLake insert/write choose Vortex
DuckLakeInsert::GetCopyOptionscurrently hardcodes Parquet:info->format = "parquet"GetCopyFunction(context, "parquet")file_extension = "parquet"Source: ducklake_insert.cpp.
For Vortex, DuckLake needs a table/database option such as
data_file_format = 'vortex', then dispatch to DuckDB’s Vortex copy function with.vortexoutput files.Extend the Vortex copy function to return written file statistics
Related issue: vortex-data/vortex#7819
DuckLake plans inserts as
COPY ... RETURN_STATS. DuckDB requires the copy function to implementcopy_to_get_written_statistics; otherwise binding rejects it. The core API is in copy_function.hpp, and the binder check is in bind_copy.cpp.Current Vortex copy registration is minimal in copy_function.cpp, and Rust finalize discards the
WriteSummaryin copy.rs. This needs to expose at least row count, file size, footer size, partition keys, and ideally column stats. Vortex already has most of the write-side data inWriteSummaryfrom writer.rs.Expose Vortex as a DuckDB multi-file reader, not only a table function
DuckLake scan currently clones
parquet_scanand injects its ownDuckLakeMultiFileReader: ducklake_scan.cpp. That lets DuckLake layer delete filters, schema mapping, row IDs, snapshot IDs, and file metadata over Parquet.Current
vortex-duckdbregistersread_vortex/vortex_scanas regular table functions in lib.rs, with a plain DuckDBTableFunctionwrapper in table_function.cpp. It does not expose a DuckDBMultiFileReader/BaseFileReaderhook.For full DuckLake support, Vortex should grow a DuckDB file reader path that DuckLake can plug into its multi-file reader machinery.
Handle DuckLake row IDs, deletes, and snapshot filters
DuckLake’s multi-file reader injects delete filters and computes internal row IDs from per-file
row_id_start + file_row_number: ducklake_multi_file_reader.cpp.Vortex has
file_row_numbersupport andScanBuilder::with_row_offset, but the current multi-file Vortex datasource is not carrying DuckLake’s per-file row-id metadata. That would need to be bridged.Handle schema evolution / field IDs
DuckLake binds data files using field IDs. Parquet stores field IDs in file metadata; DuckLake currently passes them as Parquet copy options. Vortex files would need an equivalent story: either store DuckLake field IDs in Vortex file metadata or add a DuckLake-side mapping layer that can project Vortex columns by stable field identity.
This also means Vortex multi-file scanning needs to relax its current exact-dtype expectation. The existing TODOs in vortex-layout/src/scan/multi.rs already call out schema union, virtual columns, and per-file stats.
Add Vortex metadata extraction for
ducklake_add_data_filesDuckLake’s external add-files path is Parquet-specific: it calls
parquet_full_metadata, parses Parquet schema/stats, then constructsDuckLakeDataFile. Source: ducklake_add_data_files.cpp.To add existing
.vortexfiles, DuckLake needs avortex_full_metadataequivalent or a reusable C++ API that returns row count, file size, footer size, schema/field mapping, and column stats.Likely MVP
The narrowest useful MVP would be Vortex-only DuckLake tables with append and scan:
data_file_format = 'vortex'COPY TOreturns DuckDB written-file statistics.vortexand storesfile_format = 'vortex'For full DuckLake semantics, the bigger pieces are the Vortex DuckDB multi-file reader integration and format-neutralizing the Parquet assumptions inside DuckLake’s write/add-files/metadata paths.