Describe the enhancement requested
Hi,
One of the design principles of parquet from their Github page is 'Separating metadata and column data':
Separating metadata and column data.
The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.
In order to achieve the 'columns in different files', we need to
- Ensure each file has the same number of row-groups
- Ensure each corresponding row-group of each file have the same rows
- Grab the 'metadata' from each file, 'zip/attach them vertically', and write out the new metadata file
- Feed this metadata while reading the table
It looks like Arrow APIs provide nearly everything to achieve this, except for the bolded portion in point 3 above.
This ticket is requesting the addition of a new API to be able to 'zip'/'join'/'attach' metadata from 2 files.
For e.g.
import pyarrow.parquet as pq
m1 = pq.read_metadata('file1.parquet') # say this has columns: col1, col2, col3
m1.set_file_path('file1.parquet')
m2 = pq.read_metadata('file2.parquet') # say this has columns: col4, col5
m2.set_file_path('file2.parquet')
# requesting this new 'zip' API
m = m1.zip(m2) # needs to ensure same number of row groups, and same number of rows within each row group
# m will now have metadata for col1, col2, col3, col4, col5 each pointing to appropriate data file
m.write_metadata('_metadata')
One this is done, a combined data can be created using:
m = pq.read_metadata('_metadata')
data = pq.ParquetFile('file1.parquet', 'file2.parquet', metadata=m)
# data should now be able to show all columns
Component(s)
C++, Python, Parquet