[CELEBORN-2306] Master adds shutdown hook for ratis stepdown#3659
[CELEBORN-2306] Master adds shutdown hook for ratis stepdown#3659kaybhutani wants to merge 10 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a Master JVM shutdown hook and introduces an HA “graceful shutdown” path intended to transfer Raft leadership before shutdown, reducing client impact during leader termination.
Changes:
- Add a Master shutdown hook to invoke
Master.stop(...)on JVM termination. - Add HA graceful shutdown config and, when enabled, stop the Ratis server during Master shutdown (including leader transfer).
- Update
HARaftServer.stepDown()to return a boolean and add a system test around leader shutdown behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| master/src/test/java/org/apache/celeborn/service/deploy/master/clustermeta/ha/RatisMasterStatusSystemSuiteJ.java | Adds a new HA test that stops the leader Raft server and asserts closure. |
| master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala | Registers a JVM shutdown hook and conditionally stops the Raft server in onStop(). |
| master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/ha/HARaftServer.java | Enhances stop() to attempt leadership transfer; changes stepDown() to return success/failure. |
| docs/configuration/ha.md | Documents the new HA graceful shutdown config flag. |
| common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala | Adds the new config entry and accessor for HA graceful shutdown enablement. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
⏺ Review: [CELEBORN-2306] Master shutdown hook and Raft stepdown PR: #3659 | Author: kaybhutani | Change: +93 / -2 across 5 files What it does Adds a JVM shutdown hook to the Celeborn Master that, when celeborn.master.ha.gracefulShutdown.enabled is set, transfers Raft leadership to another node before shutting down. This avoids the Raft election window where no leader is available Key changes:
Issues
The shutdown hook calls stop(EXIT_IMMEDIATELY) which calls rpcEnv.stop(self), scheduling onStop() asynchronously on the RPC dispatcher. Since dispatcher threads are daemon threads, the JVM can exit before onStop() runs — meaning the Fix: Perform the Raft step-down directly in Master.stop() or the shutdown hook itself, not in the async onStop() callback. Or use a CountDownLatch to block the hook until onStop() completes.
When graceful shutdown is enabled and the node is leader: Master.onStop() calls ratisServer.stop(), then the normal statusSystem/HAMasterMetaManager shutdown path likely calls it again. This can throw IOException or produce confusing logs.
The shutdown hook is registered unconditionally in Master.initialize(), but the config description says "the master will run a shutdown hook." The config actually only gates the Raft transfer inside onStop(). Either gate the hook registration
testGracefulLeaderShutdownStepDown calls leader.stop() on a shared static RATISSERVER instance. This permanently closes the leader, breaking any subsequent tests that assume a 3-node cluster. The test should either spin up an isolated
HARaftServer.stop() unconditionally attempts transfer when the node is leader. The config gate in Master.onStop() only controls whether stop() is called early. This is arguably fine (best-effort), but worth documenting.
The TransferLeadershipRequest has a 60s timeout, but stop() itself has no guard. If the Ratis call hangs, shutdown stalls. Consider wrapping in a Future with a bounded timeout.
A successful transfer now logs from both stepDown() ("Successfully step down leader") and stop() ("Successfully transferred leadership"). Remove one. Verdict Request changes. The motivation is solid — graceful leadership transfer is valuable for production HA. But the critical issue is that the Raft transfer logic in onStop() may never execute due to daemon thread timing (Issue 1), and Suggested fix priority:
|
|
@kaybhutani, thanks for contribution. Please address above comments. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| } | ||
| } catch (Exception e) { | ||
| LOG.warn("Step down leader failed!", e); | ||
| return false; |
There was a problem hiding this comment.
We could only return false at the end of this method.
|
@kaybhutani, thanks for updates. Please run |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
made changes @SteNicholas |
|
@kaybhutani, thanks for update. Please fix the failure of CI that |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -628,13 +657,13 @@ void stepDown() { | |||
| REQUEST_TIMEOUT_MS); | |||
| RaftClientReply reply = server.transferLeadership(request); | |||
| if (reply.isSuccess()) { | |||
| LOG.info("Successfully step down leader {}.", server.getId()); | |||
| } else { | |||
| LOG.warn("Step down leader failed!"); | |||
| return true; | |||
| } | |||
| LOG.warn("Step down leader {} failed.", server.getId()); | |||
| } catch (Exception e) { | |||
| LOG.warn("Step down leader failed!", e); | |||
| LOG.warn("Step down leader {} failed.", server.getId(), e); | |||
| } | |||
| return false; | |||
There was a problem hiding this comment.
We already log Successfully transferred leadership from {} in {}ms. on successful transfer, this one is redundant as per me but open to add it if reviewers disagree.
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR resolve a correctness bug?
No.
Does this PR introduce any user-facing change?
Yes, adds an additional config.
How was this patch tested?
Added a tests and validated that.