Skip to content

[GLUTEN-11782][CORE] Optimize parquet metadata validation by sampling#12042

Open
Yao-MR wants to merge 2 commits intoapache:mainfrom
Yao-MR:feature/optimize-parquet-metadata-validation-sampling
Open

[GLUTEN-11782][CORE] Optimize parquet metadata validation by sampling#12042
Yao-MR wants to merge 2 commits intoapache:mainfrom
Yao-MR:feature/optimize-parquet-metadata-validation-sampling

Conversation

@Yao-MR
Copy link
Copy Markdown
Contributor

@Yao-MR Yao-MR commented May 6, 2026

What changes are proposed in this pull request?

When a table has many partitions, the metadata validation checks every root path with fileLimit files each, resulting in excessive I/O cost.

This patch introduces a sampling mechanism that selects a percentage of root paths for validation instead of checking all of them. The file limit is distributed evenly across the sampled paths.

Key changes:

  • Add config spark.gluten.sql.fallbackUnexpectedMetadataParquet.samplePercentage with default value 0.1 (10% sampling)
  • Use evenly spaced interval sampling for good partition coverage
  • Add unit tests for the sampling logic

How was this patch tested?

Existing tests in ParquetEncryptionDetectionSuite continue to pass without modification.

Was this patch authored or co-authored using generative AI tooling?

ISSUE: #11782

@github-actions github-actions Bot added CORE works for Gluten Core VELOX DOCS labels May 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Run Gluten Clickhouse CI on x86

@Yao-MR Yao-MR force-pushed the feature/optimize-parquet-metadata-validation-sampling branch from 706c87f to 970fb06 Compare May 6, 2026 06:09
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Run Gluten Clickhouse CI on x86

@Yao-MR Yao-MR marked this pull request as ready for review May 6, 2026 09:40
@Yao-MR Yao-MR changed the title [GLUTEN-11782][CORE] Optimize parquet metadata validation by sampling… [GLUTEN-11782][CORE] Optimize parquet metadata validation by sampling May 6, 2026
@Yao-MR Yao-MR closed this May 6, 2026
@Yao-MR Yao-MR reopened this May 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Run Gluten Clickhouse CI on x86

@Yao-MR Yao-MR closed this May 6, 2026
@Yao-MR Yao-MR reopened this May 6, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

Run Gluten Clickhouse CI on x86

… root paths

When a table has many partitions, the metadata validation checks every
root path with `fileLimit` files each, resulting in excessive I/O cost.

This patch introduces a sampling mechanism that selects a percentage of
root paths for validation instead of checking all of them. The file limit
is distributed evenly across the sampled paths.

Key changes:
- Add config `spark.gluten.sql.fallbackUnexpectedMetadataParquet.samplePercentage`
  with default value 0.1 (10% sampling)
- Use evenly spaced interval sampling for good partition coverage
- Add unit tests for the sampling logic
@Yao-MR Yao-MR force-pushed the feature/optimize-parquet-metadata-validation-sampling branch from 51b280f to e99a01e Compare May 7, 2026 01:43
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

@Yao-MR
Copy link
Copy Markdown
Contributor Author

Yao-MR commented May 7, 2026

hi @jinchengchenghh, make a optimized sample validation which can improve the performance.
can you make a review when you have a change ?
thx.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

@Yao-MR Yao-MR closed this May 7, 2026
@Yao-MR Yao-MR reopened this May 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

@Yao-MR Yao-MR closed this May 7, 2026
@Yao-MR Yao-MR reopened this May 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant