Skip to content

[SPARK-55974][CORE][YARN] Relaunch new executors if the executor launching take too long time#54771

Open
AngersZhuuuu wants to merge 3 commits intoapache:masterfrom
AngersZhuuuu:SPARK-55974
Open

[SPARK-55974][CORE][YARN] Relaunch new executors if the executor launching take too long time#54771
AngersZhuuuu wants to merge 3 commits intoapache:masterfrom
AngersZhuuuu:SPARK-55974

Conversation

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Mar 12, 2026

What changes were proposed in this pull request?

This PR adds executor launch timeout tracking and automatic relaunch for YARN mode. When an executor takes too long to register with the driver after its container is allocated, the Application Master (AM) will treat it as stuck and request replacement containers.

Why are the changes needed?

In YARN mode, executors can get stuck during launch (e.g., slow node, resource contention, network issues). Without a timeout, the AM keeps waiting indefinitely, which can:

  • Block progress when executors never register.
  • Prevent new executors from being requested.
  • Cause jobs to hang or run with fewer executors than expected.

This change adds a configurable timeout so the AM can detect stuck launches and request replacement executors, improving reliability and resource utilization.

Does this PR introduce any user-facing change?

Yes. A new configuration is introduced:spark.yarn.containerLaunchTimeout (default: 10min)

How was this patch tested?

Added UT

Was this patch authored or co-authored using generative AI tooling?

No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant