Skip to content

[CELEBORN-2306] Master adds shutdown hook for ratis stepdown#3659

Open
kaybhutani wants to merge 10 commits intoapache:mainfrom
kaybhutani:kartikay/graceful-master-shutdown
Open

[CELEBORN-2306] Master adds shutdown hook for ratis stepdown#3659
kaybhutani wants to merge 10 commits intoapache:mainfrom
kaybhutani:kartikay/graceful-master-shutdown

Conversation

@kaybhutani
Copy link
Copy Markdown
Contributor

@kaybhutani kaybhutani commented Apr 12, 2026

What changes were proposed in this pull request?

  • Adds master shutdown hook
  • Updates RAFT stepdown to return bool
  • Calls RAFT stepdown on manager shutdown to do graceful stepdown

Why are the changes needed?

  • Manager.stop() isnt being called from anywhere, it emmits some logs as well for "Stopping manager" but since there is no shutdown hook defined, none of them are logged or the function is called at all
  • We faced a certain issue where the leader got removed from service mesh before shutdown and followers redirected to it because it was still running (able to send requests but not receive) for a brief period. This method adds a graceful shutdown option to do a RATIS stepdown before shutting down.

Does this PR resolve a correctness bug?

No.

Does this PR introduce any user-facing change?

Yes, adds an additional config.

How was this patch tested?

Added a tests and validated that.

@kaybhutani kaybhutani changed the title Master should call shutdown hooks and transfer raft leadership gracefully [CELEBORN-2306] Master should call shutdown hooks and transfer raft leadership gracefully Apr 12, 2026
@kaybhutani kaybhutani changed the title [CELEBORN-2306] Master should call shutdown hooks and transfer raft leadership gracefully [CELEBORN-2306] Register Master shutdown and transfer Raft leadership gracefully Apr 12, 2026
@kaybhutani kaybhutani changed the title [CELEBORN-2306] Register Master shutdown and transfer Raft leadership gracefully [CELEBORN-2306] Register Master shutdown hooks and transfer Raft leadership gracefully Apr 12, 2026
@kaybhutani kaybhutani changed the title [CELEBORN-2306] Register Master shutdown hooks and transfer Raft leadership gracefully [CELEBORN-2306] Master shutdown hook and Raft stepdown Apr 12, 2026
@kaybhutani kaybhutani marked this pull request as draft April 12, 2026 21:40
@kaybhutani kaybhutani closed this Apr 12, 2026
@kaybhutani kaybhutani reopened this Apr 17, 2026
@kaybhutani kaybhutani marked this pull request as ready for review April 17, 2026 07:04
@SteNicholas SteNicholas requested a review from Copilot April 20, 2026 03:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a Master JVM shutdown hook and introduces an HA “graceful shutdown” path intended to transfer Raft leadership before shutdown, reducing client impact during leader termination.

Changes:

  • Add a Master shutdown hook to invoke Master.stop(...) on JVM termination.
  • Add HA graceful shutdown config and, when enabled, stop the Ratis server during Master shutdown (including leader transfer).
  • Update HARaftServer.stepDown() to return a boolean and add a system test around leader shutdown behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
master/src/test/java/org/apache/celeborn/service/deploy/master/clustermeta/ha/RatisMasterStatusSystemSuiteJ.java Adds a new HA test that stops the leader Raft server and asserts closure.
master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala Registers a JVM shutdown hook and conditionally stops the Raft server in onStop().
master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/ha/HARaftServer.java Enhances stop() to attempt leadership transfer; changes stepDown() to return success/failure.
docs/configuration/ha.md Documents the new HA graceful shutdown config flag.
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Adds the new config entry and accessor for HA graceful shutdown enablement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
Comment thread docs/configuration/ha.md Outdated
@SteNicholas
Copy link
Copy Markdown
Member

⏺ Review: [CELEBORN-2306] Master shutdown hook and Raft stepdown

PR: #3659 | Author: kaybhutani | Change: +93 / -2 across 5 files

What it does

Adds a JVM shutdown hook to the Celeborn Master that, when celeborn.master.ha.gracefulShutdown.enabled is set, transfers Raft leadership to another node before shutting down. This avoids the Raft election window where no leader is available
and clients see failures.

Key changes:

  • New config: celeborn.master.ha.gracefulShutdown.enabled (default false)
  • HARaftServer.stepDown() now returns boolean
  • HARaftServer.stop() attempts leadership transfer before closing
  • Shutdown hook registered in Master.initialize()
  • New test for leader step-down + close

Issues

  1. (High) Shutdown hook may never trigger Raft transfer

The shutdown hook calls stop(EXIT_IMMEDIATELY) which calls rpcEnv.stop(self), scheduling onStop() asynchronously on the RPC dispatcher. Since dispatcher threads are daemon threads, the JVM can exit before onStop() runs — meaning the
leadership transfer never happens.

Fix: Perform the Raft step-down directly in Master.stop() or the shutdown hook itself, not in the async onStop() callback. Or use a CountDownLatch to block the hook until onStop() completes.

  1. (High) Double-close of Raft server

When graceful shutdown is enabled and the node is leader: Master.onStop() calls ratisServer.stop(), then the normal statusSystem/HAMasterMetaManager shutdown path likely calls it again. This can throw IOException or produce confusing logs.
HARaftServer.stop() should be idempotent (e.g., guard with an AtomicBoolean).

  1. (Medium) Config/doc mismatch

The shutdown hook is registered unconditionally in Master.initialize(), but the config description says "the master will run a shutdown hook." The config actually only gates the Raft transfer inside onStop(). Either gate the hook registration
on the config, or fix the description.

  1. (Medium) Test corrupts shared cluster state

testGracefulLeaderShutdownStepDown calls leader.stop() on a shared static RATISSERVER instance. This permanently closes the leader, breaking any subsequent tests that assume a 3-node cluster. The test should either spin up an isolated
cluster, or restart the server in a finally block.

  1. (Low) Leadership transfer always attempted in stop(), regardless of config

HARaftServer.stop() unconditionally attempts transfer when the node is leader. The config gate in Master.onStop() only controls whether stop() is called early. This is arguably fine (best-effort), but worth documenting.

  1. (Low) No independent timeout on transfer in stop()

The TransferLeadershipRequest has a 60s timeout, but stop() itself has no guard. If the Ratis call hangs, shutdown stalls. Consider wrapping in a Future with a bounded timeout.

  1. (Nit) Duplicate success logging

A successful transfer now logs from both stepDown() ("Successfully step down leader") and stop() ("Successfully transferred leadership"). Remove one.


Verdict

Request changes. The motivation is solid — graceful leadership transfer is valuable for production HA. But the critical issue is that the Raft transfer logic in onStop() may never execute due to daemon thread timing (Issue 1), and
double-close needs a guard (Issue 2). The test also needs isolation (Issue 4).

Suggested fix priority:

  1. Move Raft shutdown into Master.stop() / the hook directly, not onStop()
  2. Make HARaftServer.stop() idempotent
  3. Fix config description
  4. Isolate the test

@SteNicholas
Copy link
Copy Markdown
Member

@kaybhutani, thanks for contribution. Please address above comments.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@SteNicholas SteNicholas requested a review from zaynt4606 April 22, 2026 03:58
}
} catch (Exception e) {
LOG.warn("Step down leader failed!", e);
return false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could only return false at the end of this method.

@SteNicholas SteNicholas requested a review from Copilot April 22, 2026 04:02
@SteNicholas SteNicholas changed the title [CELEBORN-2306] Master shutdown hook and Raft stepdown [CELEBORN-2306] Master adds shutdown hook for ratis stepdown Apr 22, 2026
@SteNicholas
Copy link
Copy Markdown
Member

@kaybhutani, thanks for updates. Please run UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite to add document for applying suggestion of celeborn.master.ha.gracefulShutdown.enabled.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala Outdated
@kaybhutani
Copy link
Copy Markdown
Contributor Author

made changes @SteNicholas

@SteNicholas
Copy link
Copy Markdown
Member

@kaybhutani, thanks for update. Please fix the failure of CI that GrpcRatisMasterStatusSystemSuiteJ>RatisMasterStatusSystemSuiteJ.testGracefulLeaderShutdownStepDown:1919 A new leader should be elected after step down.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 648 to +666
@@ -628,13 +657,13 @@ void stepDown() {
REQUEST_TIMEOUT_MS);
RaftClientReply reply = server.transferLeadership(request);
if (reply.isSuccess()) {
LOG.info("Successfully step down leader {}.", server.getId());
} else {
LOG.warn("Step down leader failed!");
return true;
}
LOG.warn("Step down leader {} failed.", server.getId());
} catch (Exception e) {
LOG.warn("Step down leader failed!", e);
LOG.warn("Step down leader {} failed.", server.getId(), e);
}
return false;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already log Successfully transferred leadership from {} in {}ms. on successful transfer, this one is redundant as per me but open to add it if reviewers disagree.

Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
@kaybhutani kaybhutani marked this pull request as draft May 4, 2026 10:50
@kaybhutani kaybhutani marked this pull request as ready for review May 4, 2026 15:03
@kaybhutani kaybhutani requested a review from SteNicholas May 4, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants