[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata

### Describe the enhancement requested

Hi,

One of the design principles of parquet from their Github page is '[Separating metadata and column data](https://github.com/apache/parquet-format/tree/master?tab=readme-ov-file#separating-metadata-and-column-data)':

> # Separating metadata and column data.
> 
> The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files.

In order to achieve the 'columns in different files', we need to

1. Ensure each file has the same number of row-groups
2. Ensure each corresponding row-group of each file have the same rows
3. Grab the 'metadata' from each file, '**zip/attach them vertically**', and write out the new metadata file
4. Feed this metadata while reading the table

It looks like Arrow APIs provide nearly everything to achieve this, except for the bolded portion in point 3 above.

This ticket is requesting the addition of a new API to be able to 'zip'/'join'/'attach' metadata from 2 files.

For e.g.

~~~python
import pyarrow.parquet as pq
m1 = pq.read_metadata('file1.parquet')  # say this has columns: col1, col2, col3
m1.set_file_path('file1.parquet')

m2 = pq.read_metadata('file2.parquet')  # say this has columns: col4, col5
m2.set_file_path('file2.parquet')

# requesting this new 'zip' API
m = m1.zip(m2)  # needs to ensure same number of row groups, and same number of rows within each row group

# m will now have metadata for col1, col2, col3, col4, col5 each pointing to appropriate data file

m.write_metadata('_metadata')

~~~

One this is done, a combined data can be created using:

~~~python
m = pq.read_metadata('_metadata')
data = pq.ParquetFile('file1.parquet', 'file2.parquet', metadata=m)

# data should now be able to show all columns
~~~

### Component(s)

C++, Python, Parquet 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata #40958

Describe the enhancement requested

Separating metadata and column data.

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Parquet][Python] New API to 'zip' or (vertically) 'attach' parquet metadata #40958

Description

Describe the enhancement requested

Separating metadata and column data.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions