Skip to content

Epic: Vortex DuckLake - maybe? #7820

@gatesn

Description

@gatesn

Main Changes

  1. Carry data file format through DuckLake metadata

    DuckLake’s metadata schema already has ducklake_data_file.file_format, and the spec documents it, currently effectively as Parquet-only: DuckLake data_file spec.

    But the implementation discards/hardcodes it. DuckLakeDataFile has no data-format field, and metadata flush writes literal "parquet" in ducklake_metadata_manager.cpp. DuckLake would need a DataFileFormat field, probably enum or normalized string, carried through file select/read/list/flush paths.

  2. Make DuckLake insert/write choose Vortex

    DuckLakeInsert::GetCopyOptions currently hardcodes Parquet:

    • info->format = "parquet"
    • GetCopyFunction(context, "parquet")
    • file_extension = "parquet"
    • Parquet-specific options like field IDs and encryption config

    Source: ducklake_insert.cpp.

    For Vortex, DuckLake needs a table/database option such as data_file_format = 'vortex', then dispatch to DuckDB’s Vortex copy function with .vortex output files.

  3. Extend the Vortex copy function to return written file statistics

    Related issue: vortex-data/vortex#7819

    DuckLake plans inserts as COPY ... RETURN_STATS. DuckDB requires the copy function to implement copy_to_get_written_statistics; otherwise binding rejects it. The core API is in copy_function.hpp, and the binder check is in bind_copy.cpp.

    Current Vortex copy registration is minimal in copy_function.cpp, and Rust finalize discards the WriteSummary in copy.rs. This needs to expose at least row count, file size, footer size, partition keys, and ideally column stats. Vortex already has most of the write-side data in WriteSummary from writer.rs.

  4. Expose Vortex as a DuckDB multi-file reader, not only a table function

    DuckLake scan currently clones parquet_scan and injects its own DuckLakeMultiFileReader: ducklake_scan.cpp. That lets DuckLake layer delete filters, schema mapping, row IDs, snapshot IDs, and file metadata over Parquet.

    Current vortex-duckdb registers read_vortex/vortex_scan as regular table functions in lib.rs, with a plain DuckDB TableFunction wrapper in table_function.cpp. It does not expose a DuckDB MultiFileReader/BaseFileReader hook.

    For full DuckLake support, Vortex should grow a DuckDB file reader path that DuckLake can plug into its multi-file reader machinery.

  5. Handle DuckLake row IDs, deletes, and snapshot filters

    DuckLake’s multi-file reader injects delete filters and computes internal row IDs from per-file row_id_start + file_row_number: ducklake_multi_file_reader.cpp.

    Vortex has file_row_number support and ScanBuilder::with_row_offset, but the current multi-file Vortex datasource is not carrying DuckLake’s per-file row-id metadata. That would need to be bridged.

  6. Handle schema evolution / field IDs

    DuckLake binds data files using field IDs. Parquet stores field IDs in file metadata; DuckLake currently passes them as Parquet copy options. Vortex files would need an equivalent story: either store DuckLake field IDs in Vortex file metadata or add a DuckLake-side mapping layer that can project Vortex columns by stable field identity.

    This also means Vortex multi-file scanning needs to relax its current exact-dtype expectation. The existing TODOs in vortex-layout/src/scan/multi.rs already call out schema union, virtual columns, and per-file stats.

  7. Add Vortex metadata extraction for ducklake_add_data_files

    DuckLake’s external add-files path is Parquet-specific: it calls parquet_full_metadata, parses Parquet schema/stats, then constructs DuckLakeDataFile. Source: ducklake_add_data_files.cpp.

    To add existing .vortex files, DuckLake needs a vortex_full_metadata equivalent or a reusable C++ API that returns row count, file size, footer size, schema/field mapping, and column stats.

Likely MVP

The narrowest useful MVP would be Vortex-only DuckLake tables with append and scan:

  • DuckLake option: data_file_format = 'vortex'
  • Vortex COPY TO returns DuckDB written-file statistics
  • DuckLake writes .vortex and stores file_format = 'vortex'
  • DuckLake scan dispatches to a Vortex multi-file reader
  • Initially reject mixed Parquet/Vortex tables, add-existing-files, encryption, and possibly deletes/schema evolution until those paths are implemented

For full DuckLake semantics, the bigger pieces are the Vortex DuckDB multi-file reader integration and format-neutralizing the Parquet assumptions inside DuckLake’s write/add-files/metadata paths.

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicPublic roadmap umbrella for a major initiative, with work tracked in sub-issues.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions