Skip to content

GOBBLIN-2265: Reliable terminal job-completion status for temporal-on-YARN (AM final status + launcher hooks + exit codes + prompt un-register)#4197

Open
pratapaditya04 wants to merge 2 commits into
apache:masterfrom
pratapaditya04:pratapaditya04/temporal-am-job-completion-gte-v2
Open

GOBBLIN-2265: Reliable terminal job-completion status for temporal-on-YARN (AM final status + launcher hooks + exit codes + prompt un-register)#4197
pratapaditya04 wants to merge 2 commits into
apache:masterfrom
pratapaditya04:pratapaditya04/temporal-am-job-completion-gte-v2

Conversation

@pratapaditya04
Copy link
Copy Markdown
Contributor

@pratapaditya04 pratapaditya04 commented May 26, 2026

Summary

Makes a temporal-on-YARN job's true outcome reliably observable end-to-end without a fragile JVM shutdown hook, exposes clean extension points so a single launcher subclass can be the sole terminal-GobblinTrackingEvent (GTE) source, and fixes a shutdown race that delayed the AM's YARN un-register by minutes. (Reworks the earlier shutdown-hook approach after finding it could silently drop the GTE.)

Tracks GOBBLIN-2265.

Problem

  1. The YARN ApplicationMaster un-registered with a hardcoded FinalApplicationStatus.SUCCEEDED, so a launcher polling the ApplicationReport saw SUCCEEDED even when the workflow FAILED.
  2. An earlier attempt emitted the terminal GTE from a JVM shutdown hook — but RootMetricContext registers its own shutdown hook that closes the Kafka event reporters, and hook ordering is non-deterministic, so the GTE could be dropped.
  3. The AM and launcher returned success exit codes in some failure scenarios, masking failures on dashboards.
  4. AM un-register was delayed by minutes on shutdown. YarnService.shutDown() waits on allContainersStopped, but the last container's removal did not reliably notify that monitor (the AMRM onContainersCompleted path never notified, and DynamicScalingYarnService additionally early-returned on the shutdownInProgress branch without notifying). The waiter then slept the full timeout, so unregisterApplicationMaster fired minutes late and YARN failed the attempt even for a successful workflow.

Changes

  • AM reports the real outcome to YARN. The temporal YarnService un-registers with a FinalApplicationStatus derived from the captured Temporal workflow status (COMPLETED→SUCCEEDED, CANCELED→KILLED, else FAILED), kept in lockstep with the AM JVM exit code (mapWorkflowStatusToFinalAppStatus / computeExitCode).
  • Removed the unreliable shutdown-hook GTE emitter; kept status capture + exit-code propagation.
  • Feature flag (default true)gobblin.temporal.job.completion.gte.emission.enabled gates the AM in-workflow JOB_SUCCEEDED/JOB_FAILED, so OSS stays self-complete; deployments that designate a launcher subclass as the single GTE source set it false.
  • GobblinYarnAppLauncher — exit code + hooks, no emission. Surfaces FAILED/KILLED/UNDEFINED, lost-AM-visibility, and never-launched applications as a non-zero exit code, and exposes protected no-op hooks (onTerminalApplicationStatus / onLostAmVisibility / onApplicationLaunchFailure) plus getApplicationId/getApplicationName/getConfig accessors. The OSS launcher itself emits no GTE — a subclass overrides the hooks to emit the single terminal GTE.
  • Prompt AM un-register on shutdown. Added YarnService.notifyIfAllContainersStopped() and call it from handleContainerCompletion and onContainerStopped, and from both early-return paths of DynamicScalingYarnService.handleContainerCompletion — so whichever callback removes the last container wakes shutDown() immediately. Reduced the fallback wait from 5m to 2m (now only a missed-notify backstop). In testing this collapsed the time-to-un-register from ~8 min to seconds after the last container stops.

Downstream idempotency

KafkaJobStatusMonitor only fires the observability event / REEVALUATE dag action on the first transition into a finished status (isNewStateTransitionToFinal), so a duplicate terminal GTE (or a Kafka redelivery) for the same flow execution is a no-op — no source-side dedup is required for correctness.

Testing

  • gobblin-temporal and gobblin-yarn compile clean; unit tests pass, incl. new mapWorkflowStatusToFinalAppStatus cases and launcher exit-code/hook-dispatch tests.
  • End-to-end on a prod-ltx1 GaaS distcp flow: workflow COMPLETED → un-register SUCCEEDED, AM exit 0; a permission-denied source produced workflow FAILED → un-register FAILED, AM exit 1. With the shutdown fix, the un-register fired within seconds of the last container stopping instead of waiting out the timeout.

Note

The LinkedIn-internal RobinGobblinYarnAppLauncher (separate repo) overrides the new hooks to be the single terminal-GTE source; that change lands in a follow-up PR after this publishes.

🤖 Generated with Claude Code

@pratapaditya04 pratapaditya04 force-pushed the pratapaditya04/temporal-am-job-completion-gte-v2 branch from 2029579 to f5e17bf Compare June 1, 2026 08:03
@pratapaditya04 pratapaditya04 changed the title GOBBLIN-XXXX: GGW-emitted terminal GTE on AM failure + AM/launcher exit-code propagation GOBBLIN-XXXX: Reliable terminal job-completion status for temporal-on-YARN (AM final status + launcher GTE hooks + exit codes) Jun 1, 2026
@pratapaditya04 pratapaditya04 force-pushed the pratapaditya04/temporal-am-job-completion-gte-v2 branch from f5e17bf to 190cad9 Compare June 1, 2026 08:27
@pratapaditya04 pratapaditya04 changed the title GOBBLIN-XXXX: Reliable terminal job-completion status for temporal-on-YARN (AM final status + launcher GTE hooks + exit codes) GOBBLIN-XXXX: Reliable terminal job-completion status for temporal-on-YARN (AM final status + launcher hooks + exit codes) Jun 1, 2026
@pratapaditya04 pratapaditya04 force-pushed the pratapaditya04/temporal-am-job-completion-gte-v2 branch 3 times, most recently from 820af02 to 6e617be Compare June 1, 2026 09:20
…-YARN (AM final status + launcher hooks + exit codes)

Make a temporal-on-YARN job's true outcome reliably observable end-to-end without depending on a
fragile JVM shutdown hook, and expose clean extension points for a single terminal-GTE source.

- AM un-registers with a FinalApplicationStatus derived from the Temporal workflow outcome
  (COMPLETED->SUCCEEDED, CANCELED->KILLED, else FAILED) instead of an unconditional SUCCEEDED, so a
  launcher polling the ApplicationReport sees the real result. Kept in lockstep with the AM JVM exit
  code via mapWorkflowStatusToFinalAppStatus/computeExitCode.
- Remove the unreliable JVM-shutdown-hook GTE emitter (RootMetricContext closes the Kafka reporters
  in its own shutdown hook, so the GTE could be silently dropped). Status capture and exit-code
  propagation are retained.
- Restore the in-workflow JOB_SUCCEEDED/JOB_FAILED GTEs and gate them behind a new
  gobblin.temporal.job.completion.gte.emission.enabled flag (default true), so a standalone OSS
  deployment stays self-complete; deployments that designate a launcher subclass as the single GTE
  source set it false.
- GobblinYarnAppLauncher: surface FAILED/KILLED/UNDEFINED, lost-AM-visibility, and never-launched
  applications as a non-zero launcher exit code, and expose protected no-op hooks
  (onTerminalApplicationStatus / onLostAmVisibility / onApplicationLaunchFailure) plus
  getApplicationId/getApplicationName/getConfig accessors so a subclass can emit the single terminal
  GTE. The OSS launcher itself emits no GTE.
- On graceful shutdown, force-kill the YARN application after the wait so the RM does not re-attempt
  the AM on cancel.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pratapaditya04 pratapaditya04 force-pushed the pratapaditya04/temporal-am-job-completion-gte-v2 branch from 6e617be to bfb3a20 Compare June 1, 2026 09:35
…p debug logging

Fix a notify race in the temporal AM shutdown path: shutDown() waits on
allContainersStopped, but the two callbacks that observe a container going
away (AMRM onContainersCompleted -> handleContainerCompletion, and NM
onContainerStopped) did not reliably notify the waiter. DynamicScalingYarnService
additionally early-returned on the shutdownInProgress branch without notifying.
As a result the waiter slept the full wait timeout, delaying
unregisterApplicationMaster by minutes and causing YARN to fail the attempt
even on a successful workflow.

- Add YarnService.notifyIfAllContainersStopped(); call it from
  handleContainerCompletion and onContainerStopped (base) and from both
  early-return paths of DynamicScalingYarnService.handleContainerCompletion.
- Reduce the container-stop fallback wait from 5m to 2m (the notify makes the
  normal path immediate; this is only a missed-notify backstop).
- Replace the temporary debug markers with normal INFO-level logging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pratapaditya04 pratapaditya04 changed the title GOBBLIN-XXXX: Reliable terminal job-completion status for temporal-on-YARN (AM final status + launcher hooks + exit codes) GOBBLIN-2265: Reliable terminal job-completion status for temporal-on-YARN (AM final status + launcher hooks + exit codes + prompt un-register) Jun 1, 2026
@pratapaditya04 pratapaditya04 marked this pull request as ready for review June 1, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant