Skip to content

RATIS-2427. Close LogAppender for illegalStateException#1369

Merged
szetszwo merged 3 commits intoapache:masterfrom
symious:RATIS-2427
Mar 12, 2026
Merged

RATIS-2427. Close LogAppender for illegalStateException#1369
szetszwo merged 3 commits intoapache:masterfrom
symious:RATIS-2427

Conversation

@symious
Copy link
Contributor

@symious symious commented Mar 9, 2026

What changes were proposed in this pull request?

We met the following exception.

2026-03-08 13:51:10,388 [91193815-8c02-4e50-8cdc-792d6a5c1cdd@group-55811D60EE83->8e3e9d6b-d6d2-4ee4-b6ac-b310cf7117f1-GrpcLogAppender-LogAppenderDaemon] WARN org.apache.ratis.server.leader.LogAppenderDaemon: 91193815-8c02-4e50-8cdc-792d6a5c1cdd@group-55811D60EE83->8e3e9d6b-d6d2-4ee4-b6ac-b310cf7117f1-GrpcLogAppender-LogAppenderDaemon failed
java.lang.IllegalStateException: 91193815-8c02-4e50-8cdc-792d6a5c1cdd@group-55811D60EE83-SegmentedRaftLog is expected to be opened but it is CLOSED
        at org.apache.ratis.util.OpenCloseState.assertOpen(OpenCloseState.java:63)
        at org.apache.ratis.server.raftlog.RaftLogBase.checkLogState(RaftLogBase.java:112)
        at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getLastEntryTermIndex(SegmentedRaftLog.java:356)
        at org.apache.ratis.server.raftlog.RaftLog.getNextIndex(RaftLog.java:83)
        at org.apache.ratis.grpc.server.GrpcLogAppender.shouldNotifyToInstallSnapshot(GrpcLogAppender.java:826)
        at org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:247)
        at org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:261)
        at org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ratis.util.OpenCloseState$CloseTrace: Close 91193815-8c02-4e50-8cdc-792d6a5c1cdd@group-55811D60EE83-SegmentedRaftLog
        at org.apache.ratis.util.OpenCloseState.lambda$close$1(OpenCloseState.java:109)
        at java.util.concurrent.atomic.AtomicReference.getAndUpdate(AtomicReference.java:160)
        at org.apache.ratis.util.OpenCloseState.close(OpenCloseState.java:109)
        at org.apache.ratis.server.raftlog.RaftLogBase.close(RaftLogBase.java:386)
        at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.close(SegmentedRaftLog.java:528)
        at org.apache.ratis.server.raftlog.RaftLogBase.lambda$appendImpl$3(RaftLogBase.java:203)
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
        at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog$Task.failed(SegmentedRaftLog.java:109)
        at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$WriteLog.failed(SegmentedRaftLogWorker.java:527)
        at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:324)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more 

And this ticket is to skip the restart for this case, all it will restart forever.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/RATIS-2427

How was this patch tested?

Test locally.

@szetszwo
Copy link
Contributor

szetszwo commented Mar 9, 2026

... it will restart forever.

When the log has been closed, the server should shut down. So, how could it restart forever?

@symious
Copy link
Contributor Author

symious commented Mar 10, 2026

LOG.info(this + " was interrupted: " + e);
} catch (InterruptedIOException e) {
LOG.info(this + " I/O was interrupted: " + e);
} catch (IllegalStateException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@symious , you are right that we should not restart LogAppender.

Instead of catching IllegalStateException. Let's check if the RaftLog is open. Also, LogAppender isRunning() should return false.

diff --git a/ratis-server-api/src/main/java/org/apache/ratis/server/raftlog/RaftLog.java b/ratis-server-api/src/main/java/org/apache/ratis/server/raftlog/RaftLog.java
index e504462b8..e194f865e 100644
--- a/ratis-server-api/src/main/java/org/apache/ratis/server/raftlog/RaftLog.java
+++ b/ratis-server-api/src/main/java/org/apache/ratis/server/raftlog/RaftLog.java
@@ -43,6 +43,9 @@ public interface RaftLog extends RaftLogSequentialOps, Closeable {
   /** Invalid log index is used to indicate that the log index is missing. */
   long INVALID_LOG_INDEX = LEAST_VALID_LOG_INDEX - 1;
 
+  /** Is this log already opened but not yet closed? */
+  boolean isOpened();
+
   /** Does this log contains the given {@link TermIndex}? */
   default boolean contains(TermIndex ti) {
     Objects.requireNonNull(ti, "ti == null");
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderBase.java b/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderBase.java
index 5a27cda51..be0404da3 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderBase.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderBase.java
@@ -124,7 +124,10 @@ public abstract class LogAppenderBase implements LogAppender {
 
   @Override
   public boolean isRunning() {
-    return daemon.isWorking() && server.getInfo().isLeader();
+    return daemon.isWorking()
+        && server.getInfo().isAlive()
+        && server.getInfo().isLeader()
+        && getRaftLog().isOpened();
   }
 
   @Override
@@ -133,8 +136,8 @@ public abstract class LogAppenderBase implements LogAppender {
   }
 
   void restart() {
-    if (!server.getInfo().isAlive()) {
-      LOG.warn("Failed to restart {}: server {} is not alive", this, server.getMemberId());
+    if (!isRunning()) {
+      LOG.warn("{} is not running: skipping restart", this);
       return;
     }
     getLeaderState().restart(this);
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/raftlog/RaftLogBase.java b/ratis-server/src/main/java/org/apache/ratis/server/raftlog/RaftLogBase.java
index 8c2b66f96..48b410147 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/raftlog/RaftLogBase.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/raftlog/RaftLogBase.java
@@ -113,7 +113,7 @@ public abstract class RaftLogBase implements RaftLog {
     state.assertOpen();
   }
 
-  /** Is this log already opened? */
+  @Override
   public boolean isOpened() {
     return state.isOpened();
   }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, PTAL.

Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1. Each restart create a new GrpcLogAppender which creates resources on both the leader (gRPC client) and follower (gRPC server). If there are a lot of restarts (we saw few millions of them in a span of few days), this can cause memory issues (+ we suspect gRPC resource leak, addressed in RATIS-2426).

When the log has been closed, the server should shut down. So, how could it restart forever?

I'm still not sure about this though. We might need to investigate why the log is closed, but the server was not shutdown. Edit: Seems WriteLog failure in RaftLogBase#appendImpl causes the RaftLogBase#close be closed, but not the server.

Comment on lines 126 to 131
public boolean isRunning() {
return daemon.isWorking() && server.getInfo().isLeader();
return daemon.isWorking()
&& server.getInfo().isAlive()
&& server.getInfo().isLeader()
&& getRaftLog().isOpened();
}
Copy link
Contributor

@ivandika3 ivandika3 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we take into account the Thread.currentThread().isInterrupted()?

Saw the following InterruptedIOException from GrpcLogAppender#sleep which should set the current interrupt flag, but seems it's never checked

3e6753e7-cf24-4627-a99a-d02af1901b68@group-7B917680D70D->91193815-8c02-4e50-8cdc-792d6a5c1cdd-GrpcLogAppender-LogAppenderDaemon I/O was interrupted: java.io.InterruptedIOException: Interrupted appendLog, heartbeat? false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread.currentThread().isInterrupted() is a runtime state but not an object state. So, it probably should not be in LogAppenderBase.isRunning().

Hypothetically, suppose we have a CLI to ask if a LogAppender is running. The CLI processing thread will call LogAppender.isRunning(). However, LogAppender.isRunning() has nothing to do with whether the CLI processing thread is interrupted.

Saw the following InterruptedIOException from GrpcLogAppender#sleep which should set the current interrupt flag, but seems it's never checked ...

The InterruptedIOException should be caught in LogAppenderDaemon.run(). Isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're correct, it should be checked in the daemon itself, not in API that can be used in other threads.

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@szetszwo szetszwo merged commit 2c5412c into apache:master Mar 12, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants