Skip to content

GH-48636: [C++][Parquet] Improve parquet reading using multi threads#50158

Open
OmBiradar wants to merge 1 commit into
apache:mainfrom
OmBiradar:parallel-read-for-Structs
Open

GH-48636: [C++][Parquet] Improve parquet reading using multi threads#50158
OmBiradar wants to merge 1 commit into
apache:mainfrom
OmBiradar:parallel-read-for-Structs

Conversation

@OmBiradar

@OmBiradar OmBiradar commented Jun 11, 2026

Copy link
Copy Markdown

Rationale for this change

Currently the parquet file structs are read sequentially, even when the use_threads option is used by the user. This PR aims to bridge this gap by making the struct reading truly parallel.

What changes are included in this PR?

Changes to the LoadBatch and BuildArray functions in the StructReader in the Parquet reader to enable multi threaded reads of structs in parquet files.

Are these changes tested?

  • Yes I have tested them. Details can be found in this testing and benchmarking repo
  • Weather benchmarks should be added for parallel reads is a question I would ask the reviewer.

Are there any user-facing changes?

This is purely a performance related PR. No changes to the user or API is needed.

Copilot AI review requested due to automatic review settings June 11, 2026 13:21
@OmBiradar OmBiradar requested review from pitrou and wgtmac as code owners June 11, 2026 13:21
@github-actions

Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates Parquet testing submodule pins and introduces optional parallelism when loading/building nested struct children readers to improve throughput when use_threads() is enabled.

Changes:

  • Bump testing and cpp/submodules/parquet-testing submodule commits.
  • Use ::arrow::internal::OptionalParallelFor to parallelize StructReader::LoadBatch.
  • Parallelize StructReader::BuildArray child array construction before converting chunks to single arrays.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
testing Updates submodule commit SHA.
cpp/submodules/parquet-testing Updates submodule commit SHA for parquet-testing.
cpp/src/parquet/arrow/reader.cc Adds optional parallel execution for struct child LoadBatch and BuildArray.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +738 to +740
return ::arrow::internal::OptionalParallelFor(
ctx_->reader_properties->use_threads(), static_cast<int>(children_.size()),
[&](int i) { return children_[i]->LoadBatch(records_to_read); });

@OmBiradar OmBiradar Jun 12, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template <class FUNCTION>
Status OptionalParallelFor(bool use_threads, int num_tasks, FUNCTION&& func,
                           Executor* executor = internal::GetCpuThreadPool()) {
  if (use_threads) {
    return ParallelFor(num_tasks, std::forward<FUNCTION>(func), executor);
  } else {
    for (int i = 0; i < num_tasks; ++i) {
      RETURN_NOT_OK(func(i));
    }
    return Status::OK();
  }
}

At the end, it's always cast to an int so truncating is unavoidable.

std::shared_ptr<ChunkedArray> field;
RETURN_NOT_OK(child->BuildArray(validity_io.values_read, &field));
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<ArrayData> array_data, ChunksToSingle(*field));
const int num_children = static_cast<int>(children_.size());

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template <class FUNCTION>
Status OptionalParallelFor(bool use_threads, int num_tasks, FUNCTION&& func,
                           Executor* executor = internal::GetCpuThreadPool()) {
  if (use_threads) {
    return ParallelFor(num_tasks, std::forward<FUNCTION>(func), executor);
  } else {
    for (int i = 0; i < num_tasks; ++i) {
      RETURN_NOT_OK(func(i));
    }
    return Status::OK();
  }
}

At the end, it's always cast to an int so truncating is unavoidable.

@OmBiradar OmBiradar changed the title Improve parquet reading using multi threads GH-48636: [C++][Parquet] Improve parquet reading using multi threads Jun 11, 2026
Signed-off-by: OmBiradar <ombiradar04@gmail.com>
@OmBiradar OmBiradar force-pushed the parallel-read-for-Structs branch from fdade64 to 3512d99 Compare June 12, 2026 01:20
@OmBiradar

Copy link
Copy Markdown
Author

@wgtmac @pitrou

Should I add benchmarks for the multi threaded reading scenario, I am asking as I have not seen any other multi threaded performance benchmarks.

@wgtmac

wgtmac commented Jun 12, 2026

Copy link
Copy Markdown
Member

TBH, I don't think it is a good approach as we've tried this in the past. The main gotcha is that reading costs of different columns vary significantly by nature. For example, strings take longer time to decompress and decode but integers are smaller and faster. If the file is on a cloud object store, the majority time is blocked on waiting for I/O which may exhaust the thread pool if it is a wide column file.

@OmBiradar

OmBiradar commented Jun 12, 2026

Copy link
Copy Markdown
Author

Yes, even I faced the same IO limitation while testing the changes.

But as this is only triggered when the user wants multi threaded reading. When the file is close to the processor i.e in memory there is a significant improvement (considering a normal case). I personally tested it by loading the whole file into memory and then benchmarking the reading.

An edge case like a heavy column (I mean w.r.t the computing time) the delay is inevitable unless there is a complex system to manage the building and loading of a single column using multiple threads.

I would say that this change would not be the best case but give workable results?

It would always beat a single thread I think? Only exception I can think is of having many many light weight columns in the struct, then maybe the overhead of syncing the threads might slow down the multi threaded approach over the single threaded one.

Could you suggest any other approach that I am not able to think about?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants