GOBBLIN-2265: Reliable terminal job-completion status for temporal-on-YARN (AM final status + launcher hooks + exit codes + prompt un-register)#4197
Open
pratapaditya04 wants to merge 2 commits into
Conversation
2029579 to
f5e17bf
Compare
f5e17bf to
190cad9
Compare
820af02 to
6e617be
Compare
…-YARN (AM final status + launcher hooks + exit codes) Make a temporal-on-YARN job's true outcome reliably observable end-to-end without depending on a fragile JVM shutdown hook, and expose clean extension points for a single terminal-GTE source. - AM un-registers with a FinalApplicationStatus derived from the Temporal workflow outcome (COMPLETED->SUCCEEDED, CANCELED->KILLED, else FAILED) instead of an unconditional SUCCEEDED, so a launcher polling the ApplicationReport sees the real result. Kept in lockstep with the AM JVM exit code via mapWorkflowStatusToFinalAppStatus/computeExitCode. - Remove the unreliable JVM-shutdown-hook GTE emitter (RootMetricContext closes the Kafka reporters in its own shutdown hook, so the GTE could be silently dropped). Status capture and exit-code propagation are retained. - Restore the in-workflow JOB_SUCCEEDED/JOB_FAILED GTEs and gate them behind a new gobblin.temporal.job.completion.gte.emission.enabled flag (default true), so a standalone OSS deployment stays self-complete; deployments that designate a launcher subclass as the single GTE source set it false. - GobblinYarnAppLauncher: surface FAILED/KILLED/UNDEFINED, lost-AM-visibility, and never-launched applications as a non-zero launcher exit code, and expose protected no-op hooks (onTerminalApplicationStatus / onLostAmVisibility / onApplicationLaunchFailure) plus getApplicationId/getApplicationName/getConfig accessors so a subclass can emit the single terminal GTE. The OSS launcher itself emits no GTE. - On graceful shutdown, force-kill the YARN application after the wait so the RM does not re-attempt the AM on cancel. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
6e617be to
bfb3a20
Compare
…p debug logging Fix a notify race in the temporal AM shutdown path: shutDown() waits on allContainersStopped, but the two callbacks that observe a container going away (AMRM onContainersCompleted -> handleContainerCompletion, and NM onContainerStopped) did not reliably notify the waiter. DynamicScalingYarnService additionally early-returned on the shutdownInProgress branch without notifying. As a result the waiter slept the full wait timeout, delaying unregisterApplicationMaster by minutes and causing YARN to fail the attempt even on a successful workflow. - Add YarnService.notifyIfAllContainersStopped(); call it from handleContainerCompletion and onContainerStopped (base) and from both early-return paths of DynamicScalingYarnService.handleContainerCompletion. - Reduce the container-stop fallback wait from 5m to 2m (the notify makes the normal path immediate; this is only a missed-notify backstop). - Replace the temporary debug markers with normal INFO-level logging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes a temporal-on-YARN job's true outcome reliably observable end-to-end without a fragile JVM shutdown hook, exposes clean extension points so a single launcher subclass can be the sole terminal-
GobblinTrackingEvent(GTE) source, and fixes a shutdown race that delayed the AM's YARN un-register by minutes. (Reworks the earlier shutdown-hook approach after finding it could silently drop the GTE.)Tracks GOBBLIN-2265.
Problem
ApplicationMasterun-registered with a hardcodedFinalApplicationStatus.SUCCEEDED, so a launcher polling theApplicationReportsaw SUCCEEDED even when the workflow FAILED.RootMetricContextregisters its own shutdown hook that closes the Kafka event reporters, and hook ordering is non-deterministic, so the GTE could be dropped.YarnService.shutDown()waits onallContainersStopped, but the last container's removal did not reliably notify that monitor (the AMRMonContainersCompletedpath never notified, andDynamicScalingYarnServiceadditionally early-returned on theshutdownInProgressbranch without notifying). The waiter then slept the full timeout, sounregisterApplicationMasterfired minutes late and YARN failed the attempt even for a successful workflow.Changes
YarnServiceun-registers with aFinalApplicationStatusderived from the captured Temporal workflow status (COMPLETED→SUCCEEDED,CANCELED→KILLED, elseFAILED), kept in lockstep with the AM JVM exit code (mapWorkflowStatusToFinalAppStatus/computeExitCode).true) —gobblin.temporal.job.completion.gte.emission.enabledgates the AM in-workflowJOB_SUCCEEDED/JOB_FAILED, so OSS stays self-complete; deployments that designate a launcher subclass as the single GTE source set itfalse.GobblinYarnAppLauncher— exit code + hooks, no emission. Surfaces FAILED/KILLED/UNDEFINED, lost-AM-visibility, and never-launched applications as a non-zero exit code, and exposes protected no-op hooks (onTerminalApplicationStatus/onLostAmVisibility/onApplicationLaunchFailure) plusgetApplicationId/getApplicationName/getConfigaccessors. The OSS launcher itself emits no GTE — a subclass overrides the hooks to emit the single terminal GTE.YarnService.notifyIfAllContainersStopped()and call it fromhandleContainerCompletionandonContainerStopped, and from both early-return paths ofDynamicScalingYarnService.handleContainerCompletion— so whichever callback removes the last container wakesshutDown()immediately. Reduced the fallback wait from 5m to 2m (now only a missed-notify backstop). In testing this collapsed the time-to-un-register from ~8 min to seconds after the last container stops.Downstream idempotency
KafkaJobStatusMonitoronly fires the observability event / REEVALUATE dag action on the first transition into a finished status (isNewStateTransitionToFinal), so a duplicate terminal GTE (or a Kafka redelivery) for the same flow execution is a no-op — no source-side dedup is required for correctness.Testing
gobblin-temporalandgobblin-yarncompile clean; unit tests pass, incl. newmapWorkflowStatusToFinalAppStatuscases and launcher exit-code/hook-dispatch tests.COMPLETED→ un-registerSUCCEEDED, AM exit0; a permission-denied source produced workflowFAILED→ un-registerFAILED, AM exit1. With the shutdown fix, the un-register fired within seconds of the last container stopping instead of waiting out the timeout.Note
The LinkedIn-internal
RobinGobblinYarnAppLauncher(separate repo) overrides the new hooks to be the single terminal-GTE source; that change lands in a follow-up PR after this publishes.🤖 Generated with Claude Code