Skip to content

[SPARK-54923][SQL] Add configuration for directory traversal depth before parallel partition discovery#54787

Open
xiaoxuandev wants to merge 1 commit intoapache:masterfrom
xiaoxuandev:fix-54923
Open

[SPARK-54923][SQL] Add configuration for directory traversal depth before parallel partition discovery#54787
xiaoxuandev wants to merge 1 commit intoapache:masterfrom
xiaoxuandev:fix-54923

Conversation

@xiaoxuandev
Copy link

What changes were proposed in this pull request?

Adds a new configuration spark.sql.sources.parallelPartitionDiscovery.traversalDepth that controls how many directory levels the driver sequentially expands before delegating to parallel workers during partition discovery.

For deep-narrow directory hierarchies (e.g., root/subpath/{year}/{month}/), the top levels have very few children, so parallelizing at level 1 creates too few tasks. This config allows the driver to sequentially traverse N-1 levels first, collecting enough paths to parallelize effectively.

Key design decisions:

  • traversalDepth=1 (default) preserves identical behavior to the original code
  • Expanded intermediate paths use isRootLevel=false to preserve SPARK-27676 semantics
  • A tolerateDisappearance flag handles the race condition where directories confirmed during sequential traversal vanish before parallel listing reaches them, without propagating to recursive subdirectory calls
  • FileStatusCache keys are re-mapped to original root paths so the cache remains effective when traversalDepth > 1
  • Sequential traversal filters hidden/underscore directories but preserves _metadata and _common_metadata
  • Mixed directories (containing both files and subdirectories) stop expansion to avoid missing sibling files

Why are the changes needed?

When tables have deep-narrow directory structures (e.g., a single intermediate path like root/subpath/ before the partition columns), the parallel partition discovery creates only one task for the single root path. This means the entire directory tree is listed sequentially on a single executor, negating the benefit of parallel listing. By pre-expanding a few levels on the driver, we can distribute the work across multiple tasks.

Does this PR introduce any user-facing change?

Yes. A new SQL configuration spark.sql.sources.parallelPartitionDiscovery.traversalDepth is available. Default value is 1 (no behavior change). Setting it to N causes the driver to sequentially expand up to N-1 directory levels before parallelizing.

How was this patch tested?

  • 11 new tests in HadoopFSUtilsDeepNarrowSuite covering: default behavior, single/multi-level expansion, leaf-before-depth, wide hierarchies, missing directories, mixed files+dirs, hidden directory filtering, _metadata/_common_metadata preservation, consistency across depths, and multiple input roots
  • 2 new tests in FileIndexSuite covering: end-to-end traversalDepth with InMemoryFileIndex and config validation (rejects values < 1)

Was this patch authored or co-authored using generative AI tooling?

Yes

…fore parallel partition discovery

### What changes were proposed in this pull request?

Adds a new configuration `spark.sql.sources.parallelPartitionDiscovery.traversalDepth` that controls how many directory levels the driver sequentially expands before delegating to parallel workers during partition discovery.

For deep-narrow directory hierarchies (e.g., `root/subpath/{year}/{month}/`), the top levels have very few children, so parallelizing at level 1 creates too few tasks. This config allows the driver to sequentially traverse N-1 levels first, collecting enough paths to parallelize effectively.

Key design decisions:
- `traversalDepth=1` (default) preserves identical behavior to the original code
- Expanded intermediate paths use `isRootLevel=false` to preserve SPARK-27676 semantics
- A `tolerateDisappearance` flag handles the race condition where directories confirmed during sequential traversal vanish before parallel listing reaches them, without propagating to recursive subdirectory calls
- FileStatusCache keys are re-mapped to original root paths so the cache remains effective when `traversalDepth > 1`
- Sequential traversal filters hidden/underscore directories but preserves `_metadata` and `_common_metadata`
- Mixed directories (containing both files and subdirectories) stop expansion to avoid missing sibling files

### Why are the changes needed?

When tables have deep-narrow directory structures (e.g., a single intermediate path like `root/subpath/` before the partition columns), the parallel partition discovery creates only one task for the single root path. This means the entire directory tree is listed sequentially on a single executor, negating the benefit of parallel listing. By pre-expanding a few levels on the driver, we can distribute the work across multiple tasks.

### Does this PR introduce any user-facing change?

Yes. A new SQL configuration `spark.sql.sources.parallelPartitionDiscovery.traversalDepth` is available. Default value is 1 (no behavior change). Setting it to N causes the driver to sequentially expand up to N-1 directory levels before parallelizing.

### How was this patch tested?

- 11 new tests in HadoopFSUtilsDeepNarrowSuite covering: default behavior, single/multi-level expansion, leaf-before-depth, wide hierarchies, missing directories, mixed files+dirs, hidden directory filtering, _metadata/_common_metadata preservation, consistency across depths, and multiple input roots
- 2 new tests in FileIndexSuite covering: end-to-end traversalDepth with InMemoryFileIndex and config validation (rejects values < 1)

### Was this patch authored or co-authored using generative AI tooling?

Yes, Opus 4.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants