GH-48636: [C++][Parquet] Improve parquet reading using multi threads by OmBiradar · Pull Request #50158 · apache/arrow

OmBiradar · 2026-06-11T13:21:49Z

Rationale for this change

Currently the parquet file structs are read sequentially, even when the use_threads option is used by the user. This PR aims to bridge this gap by making the struct reading truly parallel.

What changes are included in this PR?

Changes to the LoadBatch and BuildArray functions in the StructReader in the Parquet reader to enable multi threaded reads of structs in parquet files.

Are these changes tested?

Yes I have tested them. Details can be found in this testing and benchmarking repo
Weather benchmarks should be added for parallel reads is a question I would ask the reviewer.

Are there any user-facing changes?

This is purely a performance related PR. No changes to the user or API is needed.

GitHub Issue: [Python] Enable multi-threaded reads of struct-list stored data in parquet files #48636

github-actions · 2026-06-11T13:22:15Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates Parquet testing submodule pins and introduces optional parallelism when loading/building nested struct children readers to improve throughput when use_threads() is enabled.

Changes:

Bump testing and cpp/submodules/parquet-testing submodule commits.
Use ::arrow::internal::OptionalParallelFor to parallelize StructReader::LoadBatch.
Parallelize StructReader::BuildArray child array construction before converting chunks to single arrays.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
testing	Updates submodule commit SHA.
cpp/submodules/parquet-testing	Updates submodule commit SHA for parquet-testing.
cpp/src/parquet/arrow/reader.cc	Adds optional parallel execution for struct child `LoadBatch` and `BuildArray`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

OmBiradar · 2026-06-12T01:39:21Z

+    return ::arrow::internal::OptionalParallelFor(
+        ctx_->reader_properties->use_threads(), static_cast<int>(children_.size()),
+        [&](int i) { return children_[i]->LoadBatch(records_to_read); });


template <class FUNCTION> Status OptionalParallelFor(bool use_threads, int num_tasks, FUNCTION&& func, Executor* executor = internal::GetCpuThreadPool()) { if (use_threads) { return ParallelFor(num_tasks, std::forward<FUNCTION>(func), executor); } else { for (int i = 0; i < num_tasks; ++i) { RETURN_NOT_OK(func(i)); } return Status::OK(); } }

At the end, it's always cast to an int so truncating is unavoidable.

OmBiradar · 2026-06-12T01:40:36Z

-    std::shared_ptr<ChunkedArray> field;
-    RETURN_NOT_OK(child->BuildArray(validity_io.values_read, &field));
-    ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ArrayData> array_data, ChunksToSingle(*field));
+  const int num_children = static_cast<int>(children_.size());


template <class FUNCTION> Status OptionalParallelFor(bool use_threads, int num_tasks, FUNCTION&& func, Executor* executor = internal::GetCpuThreadPool()) { if (use_threads) { return ParallelFor(num_tasks, std::forward<FUNCTION>(func), executor); } else { for (int i = 0; i < num_tasks; ++i) { RETURN_NOT_OK(func(i)); } return Status::OK(); } }

At the end, it's always cast to an int so truncating is unavoidable.

Signed-off-by: OmBiradar <ombiradar04@gmail.com>

OmBiradar · 2026-06-12T01:42:57Z

@wgtmac @pitrou

Should I add benchmarks for the multi threaded reading scenario, I am asking as I have not seen any other multi threaded performance benchmarks.

wgtmac · 2026-06-12T02:30:36Z

TBH, I don't think it is a good approach as we've tried this in the past. The main gotcha is that reading costs of different columns vary significantly by nature. For example, strings take longer time to decompress and decode but integers are smaller and faster. If the file is on a cloud object store, the majority time is blocked on waiting for I/O which may exhaust the thread pool if it is a wide column file.

OmBiradar · 2026-06-12T03:57:20Z

Yes, even I faced the same IO limitation while testing the changes.

But as this is only triggered when the user wants multi threaded reading. When the file is close to the processor i.e in memory there is a significant improvement (considering a normal case). I personally tested it by loading the whole file into memory and then benchmarking the reading.

An edge case like a heavy column (I mean w.r.t the computing time) the delay is inevitable unless there is a complex system to manage the building and loading of a single column using multiple threads.

I would say that this change would not be the best case but give workable results?

It would always beat a single thread I think? Only exception I can think is of having many many light weight columns in the struct, then maybe the overhead of syncing the threads might slow down the multi threaded approach over the single threaded one.

Could you suggest any other approach that I am not able to think about?

Copilot AI review requested due to automatic review settings June 11, 2026 13:21

OmBiradar requested review from pitrou and wgtmac as code owners June 11, 2026 13:21

github-actions Bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Jun 11, 2026

Copilot AI reviewed Jun 11, 2026

View reviewed changes

OmBiradar changed the title ~~Improve parquet reading using multi threads~~ GH-48636: [C++][Parquet] Improve parquet reading using multi threads Jun 11, 2026

Improve parquet reading using multi threads

3512d99

Signed-off-by: OmBiradar <ombiradar04@gmail.com>

OmBiradar force-pushed the parallel-read-for-Structs branch from fdade64 to 3512d99 Compare June 12, 2026 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-48636: [C++][Parquet] Improve parquet reading using multi threads#50158

GH-48636: [C++][Parquet] Improve parquet reading using multi threads#50158
OmBiradar wants to merge 1 commit into
apache:mainfrom
OmBiradar:parallel-read-for-Structs

OmBiradar commented Jun 11, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

OmBiradar Jun 12, 2026 •

edited

Loading

Uh oh!

OmBiradar Jun 12, 2026

Uh oh!

OmBiradar commented Jun 12, 2026

Uh oh!

wgtmac commented Jun 12, 2026

Uh oh!

OmBiradar commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

OmBiradar commented Jun 11, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

OmBiradar Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmBiradar Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

OmBiradar commented Jun 12, 2026

Uh oh!

wgtmac commented Jun 12, 2026

Uh oh!

OmBiradar commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OmBiradar commented Jun 11, 2026 •

edited by github-actions Bot

Loading

OmBiradar Jun 12, 2026 •

edited

Loading

OmBiradar commented Jun 12, 2026 •

edited

Loading