RATIS-2427. Close LogAppender for illegalStateException#1369
RATIS-2427. Close LogAppender for illegalStateException#1369szetszwo merged 3 commits intoapache:masterfrom
Conversation
When the log has been closed, the server should shut down. So, how could it restart forever? |
|
@szetszwo According to https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/raftlog/segmented/SegmentedRaftLogWorker.java#L326, the SegmentedRaftLogWorker will continue after closing the log. |
| LOG.info(this + " was interrupted: " + e); | ||
| } catch (InterruptedIOException e) { | ||
| LOG.info(this + " I/O was interrupted: " + e); | ||
| } catch (IllegalStateException e) { |
There was a problem hiding this comment.
@symious , you are right that we should not restart LogAppender.
Instead of catching IllegalStateException. Let's check if the RaftLog is open. Also, LogAppender isRunning() should return false.
diff --git a/ratis-server-api/src/main/java/org/apache/ratis/server/raftlog/RaftLog.java b/ratis-server-api/src/main/java/org/apache/ratis/server/raftlog/RaftLog.java
index e504462b8..e194f865e 100644
--- a/ratis-server-api/src/main/java/org/apache/ratis/server/raftlog/RaftLog.java
+++ b/ratis-server-api/src/main/java/org/apache/ratis/server/raftlog/RaftLog.java
@@ -43,6 +43,9 @@ public interface RaftLog extends RaftLogSequentialOps, Closeable {
/** Invalid log index is used to indicate that the log index is missing. */
long INVALID_LOG_INDEX = LEAST_VALID_LOG_INDEX - 1;
+ /** Is this log already opened but not yet closed? */
+ boolean isOpened();
+
/** Does this log contains the given {@link TermIndex}? */
default boolean contains(TermIndex ti) {
Objects.requireNonNull(ti, "ti == null");
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderBase.java b/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderBase.java
index 5a27cda51..be0404da3 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderBase.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderBase.java
@@ -124,7 +124,10 @@ public abstract class LogAppenderBase implements LogAppender {
@Override
public boolean isRunning() {
- return daemon.isWorking() && server.getInfo().isLeader();
+ return daemon.isWorking()
+ && server.getInfo().isAlive()
+ && server.getInfo().isLeader()
+ && getRaftLog().isOpened();
}
@Override
@@ -133,8 +136,8 @@ public abstract class LogAppenderBase implements LogAppender {
}
void restart() {
- if (!server.getInfo().isAlive()) {
- LOG.warn("Failed to restart {}: server {} is not alive", this, server.getMemberId());
+ if (!isRunning()) {
+ LOG.warn("{} is not running: skipping restart", this);
return;
}
getLeaderState().restart(this);
diff --git a/ratis-server/src/main/java/org/apache/ratis/server/raftlog/RaftLogBase.java b/ratis-server/src/main/java/org/apache/ratis/server/raftlog/RaftLogBase.java
index 8c2b66f96..48b410147 100644
--- a/ratis-server/src/main/java/org/apache/ratis/server/raftlog/RaftLogBase.java
+++ b/ratis-server/src/main/java/org/apache/ratis/server/raftlog/RaftLogBase.java
@@ -113,7 +113,7 @@ public abstract class RaftLogBase implements RaftLog {
state.assertOpen();
}
- /** Is this log already opened? */
+ @Override
public boolean isOpened() {
return state.isOpened();
}There was a problem hiding this comment.
LGTM +1. Each restart create a new GrpcLogAppender which creates resources on both the leader (gRPC client) and follower (gRPC server). If there are a lot of restarts (we saw few millions of them in a span of few days), this can cause memory issues (+ we suspect gRPC resource leak, addressed in RATIS-2426).
When the log has been closed, the server should shut down. So, how could it restart forever?
I'm still not sure about this though. We might need to investigate why the log is closed, but the server was not shutdown. Edit: Seems WriteLog failure in RaftLogBase#appendImpl causes the RaftLogBase#close be closed, but not the server.
| public boolean isRunning() { | ||
| return daemon.isWorking() && server.getInfo().isLeader(); | ||
| return daemon.isWorking() | ||
| && server.getInfo().isAlive() | ||
| && server.getInfo().isLeader() | ||
| && getRaftLog().isOpened(); | ||
| } |
There was a problem hiding this comment.
Should we take into account the Thread.currentThread().isInterrupted()?
Saw the following InterruptedIOException from GrpcLogAppender#sleep which should set the current interrupt flag, but seems it's never checked
3e6753e7-cf24-4627-a99a-d02af1901b68@group-7B917680D70D->91193815-8c02-4e50-8cdc-792d6a5c1cdd-GrpcLogAppender-LogAppenderDaemon I/O was interrupted: java.io.InterruptedIOException: Interrupted appendLog, heartbeat? false
There was a problem hiding this comment.
Thread.currentThread().isInterrupted() is a runtime state but not an object state. So, it probably should not be in LogAppenderBase.isRunning().
Hypothetically, suppose we have a CLI to ask if a LogAppender is running. The CLI processing thread will call LogAppender.isRunning(). However, LogAppender.isRunning() has nothing to do with whether the CLI processing thread is interrupted.
Saw the following InterruptedIOException from GrpcLogAppender#sleep which should set the current interrupt flag, but seems it's never checked ...
The InterruptedIOException should be caught in LogAppenderDaemon.run(). Isn't it?
There was a problem hiding this comment.
Yes, you're correct, it should be checked in the daemon itself, not in API that can be used in other threads.
szetszwo
left a comment
There was a problem hiding this comment.
+1 the change looks good.
What changes were proposed in this pull request?
We met the following exception.
And this ticket is to skip the restart for this case, all it will restart forever.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/RATIS-2427
How was this patch tested?
Test locally.