You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A node whose execution layer is significantly behind the raft-committed height can win a raft leader election. Once elected, it cannot produce new blocks until it has replayed the missing entries, creating an extended (potentially permanent) outage. In our case, a node 166,305 blocks behind raft state won leadership and the cluster produced no new blocks for over 3 hours.
Deployment: Docker Compose, one container per node, snapshot-bootstrapped
Observed behaviour
After a cascade of leader failovers, node-2 won election at raft term 192 while its execution layer was 166,305 blocks behind raft state:
# node-2 immediately after being elected leader at 07:40:41 UTC
local state behind raft state, skipping recovery to allow catchup
component=main diff=-166305 local_height=128706364 raft_height=128872669
became leader but not synced, attempting recovery component=main
recovering state from raft component=syncer height=128872669
The execution_replayer then reported the same gap every ~60 seconds without making progress:
07:16:43 execution layer is behind, syncing blocks
component=execution_replayer blocks_to_sync=166305
exec_layer_height=128706364 target_height=128872669
07:17:41 execution layer is behind, syncing blocks blocks_to_sync=166305 …
07:18:40 execution layer is behind, syncing blocks blocks_to_sync=166305 …
… (repeated at ~60s intervals, count never decreasing, for 38+ minutes)
The last block ever produced by the cluster was at 06:12:33 UTC at height 128,872,669 — more than 3 hours before the issue was discovered at 09:36.
How the execution layer fell 166,305 blocks behind
Between 01:33 (when node-2 became a follower) and 07:40 (when it won the election), node-2's execution layer stopped processing raft log entries at height 128,706,364 — the height at which it had previously been leader. Raft replication kept its raft log current (raft_height = 128,872,669), but the execution layer never applied those entries while it was a follower.
Why this is a problem
1. Raft election eligibility is not gated on execution-layer sync
Raft's log-matching safety property ensures only nodes with up-to-date raft logs can win elections. However, ev-node decouples raft log state from execution-layer state. A node with a fully replicated raft log but a stale execution layer can win an election even though it cannot immediately produce new blocks.
2. The new leader cannot produce blocks during catch-up
The leader transitions to a "recovering" state and must replay up to hundreds of thousands of blocks before resuming production. During this window the cluster produces no new blocks for potentially minutes or hours.
3. Catch-up is continuously interrupted, making recovery impossible
In our scenario, an ongoing crash loop on node-1 triggered new elections every ~60s. Each displacement aborted the in-progress catch-up. The blocks_to_sync=166305 count never decremented across 38 minutes of attempts:
07:16:43 blocks_to_sync=166305
07:17:41 blocks_to_sync=166305
…
07:54:36 blocks_to_sync=166305 ← same count, 38 minutes later
4. Follower execution lag is not surfaced in election eligibility
A follower running far behind the execution layer looks identical to a fully-synced follower from the raft election perspective. There is no mechanism to prefer a synced node (node-3, exec_height = 128,872,669) over an unsynced one (node-2, exec_height = 128,706,364) during election.
Expected behaviour
One or more of the following mitigations would address this:
Gate election eligibility on execution-layer sync. A node should withhold its pre-vote / vote until its execution layer is within a configurable threshold of raft height (e.g. max_exec_lag_blocks). This prevents an unsynchronized node from winning an election.
Defer leader operations until catch-up completes. If a node does become leader while unsynced, starting leader operations and block production should be deferred until exec_layer_height >= raft_height. The current code logs "became leader but not synced, attempting recovery" and then immediately begins leader operations anyway.
Bound follower execution lag. If a follower's execution layer falls more than N blocks behind its raft log, it should either stop voting (self-demote) or emit a high-severity metric/alert. Currently the drift is silent and only becomes visible after a failover.
Key log evidence
node-2 winning term 192 while 166,305 blocks behind (07:40 UTC):
2026-04-15T07:40:41 raft: election won: term=192 tally=2
2026-04-15T07:40:41 raft: entering leader state
2026-04-15T07:40:18 local state behind raft state, skipping recovery to allow catchup
diff=-166305 local_height=128706364 raft_height=128872669
2026-04-15T07:40:41 became leader but not synced, attempting recovery
2026-04-15T07:40:41 recovering state from raft height=128872669
2026-04-15T07:40:10 local state behind raft state, skipping recovery to allow catchup
diff=-2151 local_height=128870518 raft_height=128872669
2026-04-15T07:40:10 became leader but not synced, attempting recovery
Execution replayer stuck at same count for 38 minutes:
leader lock lost fatal crash — a node losing leadership exits the process instead of gracefully stepping down to follower. This is a separate but contributing issue: it is what triggered the repeated unsynchronized elections described here and what prevented recovery once node-2 was elected with a stale execution layer.
Summary
A node whose execution layer is significantly behind the raft-committed height can win a raft leader election. Once elected, it cannot produce new blocks until it has replayed the missing entries, creating an extended (potentially permanent) outage. In our case, a node 166,305 blocks behind raft state won leadership and the cluster produced no new blocks for over 3 hours.
Environment
node-1/node-2/node-3), ev-node-evm, raft consensus enabledObserved behaviour
After a cascade of leader failovers, node-2 won election at raft term 192 while its execution layer was 166,305 blocks behind raft state:
The
execution_replayerthen reported the same gap every ~60 seconds without making progress:The last block ever produced by the cluster was at
06:12:33 UTCat height 128,872,669 — more than 3 hours before the issue was discovered at 09:36.How the execution layer fell 166,305 blocks behind
leader lock lost); cluster enters crash loopBetween 01:33 (when node-2 became a follower) and 07:40 (when it won the election), node-2's execution layer stopped processing raft log entries at height 128,706,364 — the height at which it had previously been leader. Raft replication kept its raft log current (raft_height = 128,872,669), but the execution layer never applied those entries while it was a follower.
Why this is a problem
1. Raft election eligibility is not gated on execution-layer sync
Raft's log-matching safety property ensures only nodes with up-to-date raft logs can win elections. However, ev-node decouples raft log state from execution-layer state. A node with a fully replicated raft log but a stale execution layer can win an election even though it cannot immediately produce new blocks.
2. The new leader cannot produce blocks during catch-up
The leader transitions to a "recovering" state and must replay up to hundreds of thousands of blocks before resuming production. During this window the cluster produces no new blocks for potentially minutes or hours.
3. Catch-up is continuously interrupted, making recovery impossible
In our scenario, an ongoing crash loop on node-1 triggered new elections every ~60s. Each displacement aborted the in-progress catch-up. The
blocks_to_sync=166305count never decremented across 38 minutes of attempts:4. Follower execution lag is not surfaced in election eligibility
A follower running far behind the execution layer looks identical to a fully-synced follower from the raft election perspective. There is no mechanism to prefer a synced node (node-3, exec_height = 128,872,669) over an unsynced one (node-2, exec_height = 128,706,364) during election.
Expected behaviour
One or more of the following mitigations would address this:
Gate election eligibility on execution-layer sync. A node should withhold its pre-vote / vote until its execution layer is within a configurable threshold of raft height (e.g.
max_exec_lag_blocks). This prevents an unsynchronized node from winning an election.Defer leader operations until catch-up completes. If a node does become leader while unsynced,
starting leader operationsand block production should be deferred untilexec_layer_height >= raft_height. The current code logs "became leader but not synced, attempting recovery" and then immediately begins leader operations anyway.Bound follower execution lag. If a follower's execution layer falls more than N blocks behind its raft log, it should either stop voting (self-demote) or emit a high-severity metric/alert. Currently the drift is silent and only becomes visible after a failover.
Key log evidence
node-2 winning term 192 while 166,305 blocks behind (07:40 UTC):
node-1 (same scenario, 2,151 blocks behind) briefly winning election during crash loop:
Execution replayer stuck at same count for 38 minutes:
Related
leader lock lostfatal crash — a node losing leadership exits the process instead of gracefully stepping down to follower. This is a separate but contributing issue: it is what triggered the repeated unsynchronized elections described here and what prevented recovery once node-2 was elected with a stale execution layer.