Skip to content

Support file-level Parquet RowSelection #22939

@haohuaijin

Description

@haohuaijin

Is your feature request related to a problem or challenge?

DataFusion supports ParquetAccessPlan as a PartitionedFile extension, but that API is row-group based. Callers must know the row group layout and split any file-level row selection themselves.

DataFusion already loads the row group metadata when opening the file, so it can split that selection internally.

Describe the solution you'd like

Add a ParquetRowSelection extension that wraps a file-level RowSelection.

When opening the file, DataFusion should convert it to the existing ParquetAccessPlan form:

  • all rows skipped: RowGroupAccess::Skip
  • all rows selected: RowGroupAccess::Scan
  • mixed selected/skipped rows: RowGroupAccess::Selection

The selection should be rejected if its total row count does not match the Parquet metadata.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions