Skip to content

Raft leader election does not gate on execution layer sync: unsynchronized node can win election and block production indefinitely #3255

@auricom

Description

@auricom

Summary

A node whose execution layer is significantly behind the raft-committed height can win a raft leader election. Once elected, it cannot produce new blocks until it has replayed the missing entries, creating an extended (potentially permanent) outage. In our case, a node 166,305 blocks behind raft state won leadership and the cluster produced no new blocks for over 3 hours.


Environment

  • Cluster: 3 nodes (node-1 / node-2 / node-3), ev-node-evm, raft consensus enabled
  • Deployment: Docker Compose, one container per node, snapshot-bootstrapped

Observed behaviour

After a cascade of leader failovers, node-2 won election at raft term 192 while its execution layer was 166,305 blocks behind raft state:

# node-2 immediately after being elected leader at 07:40:41 UTC
local state behind raft state, skipping recovery to allow catchup
  component=main  diff=-166305  local_height=128706364  raft_height=128872669

became leader but not synced, attempting recovery  component=main
recovering state from raft  component=syncer  height=128872669

The execution_replayer then reported the same gap every ~60 seconds without making progress:

07:16:43  execution layer is behind, syncing blocks
          component=execution_replayer  blocks_to_sync=166305
          exec_layer_height=128706364  target_height=128872669

07:17:41  execution layer is behind, syncing blocks  blocks_to_sync=166305  …
07:18:40  execution layer is behind, syncing blocks  blocks_to_sync=166305  …
…  (repeated at ~60s intervals, count never decreasing, for 38+ minutes)

The last block ever produced by the cluster was at 06:12:33 UTC at height 128,872,669 — more than 3 hours before the issue was discovered at 09:36.


How the execution layer fell 166,305 blocks behind

Time (UTC) Event
Apr 14 10:41 Cluster bootstrapped; node-2 wins initial election (term 2), becomes leader
Apr 15 01:33 node-1 wins election (term 3); node-2 becomes follower
Apr 15 06:08:43 node-3 wins election (term 4)
Apr 15 06:12:33 Last block produced (height 128,872,669) by node-3
Apr 15 06:12:39 node-3 crashes (leader lock lost); cluster enters crash loop
Apr 15 07:40:41 node-2 wins term 192 — exec layer 166,305 blocks behind raft

Between 01:33 (when node-2 became a follower) and 07:40 (when it won the election), node-2's execution layer stopped processing raft log entries at height 128,706,364 — the height at which it had previously been leader. Raft replication kept its raft log current (raft_height = 128,872,669), but the execution layer never applied those entries while it was a follower.


Why this is a problem

1. Raft election eligibility is not gated on execution-layer sync

Raft's log-matching safety property ensures only nodes with up-to-date raft logs can win elections. However, ev-node decouples raft log state from execution-layer state. A node with a fully replicated raft log but a stale execution layer can win an election even though it cannot immediately produce new blocks.

2. The new leader cannot produce blocks during catch-up

The leader transitions to a "recovering" state and must replay up to hundreds of thousands of blocks before resuming production. During this window the cluster produces no new blocks for potentially minutes or hours.

3. Catch-up is continuously interrupted, making recovery impossible

In our scenario, an ongoing crash loop on node-1 triggered new elections every ~60s. Each displacement aborted the in-progress catch-up. The blocks_to_sync=166305 count never decremented across 38 minutes of attempts:

07:16:43  blocks_to_sync=166305
07:17:41  blocks_to_sync=166305
…
07:54:36  blocks_to_sync=166305  ← same count, 38 minutes later

4. Follower execution lag is not surfaced in election eligibility

A follower running far behind the execution layer looks identical to a fully-synced follower from the raft election perspective. There is no mechanism to prefer a synced node (node-3, exec_height = 128,872,669) over an unsynced one (node-2, exec_height = 128,706,364) during election.


Expected behaviour

One or more of the following mitigations would address this:

  1. Gate election eligibility on execution-layer sync. A node should withhold its pre-vote / vote until its execution layer is within a configurable threshold of raft height (e.g. max_exec_lag_blocks). This prevents an unsynchronized node from winning an election.

  2. Defer leader operations until catch-up completes. If a node does become leader while unsynced, starting leader operations and block production should be deferred until exec_layer_height >= raft_height. The current code logs "became leader but not synced, attempting recovery" and then immediately begins leader operations anyway.

  3. Bound follower execution lag. If a follower's execution layer falls more than N blocks behind its raft log, it should either stop voting (self-demote) or emit a high-severity metric/alert. Currently the drift is silent and only becomes visible after a failover.


Key log evidence

node-2 winning term 192 while 166,305 blocks behind (07:40 UTC):

2026-04-15T07:40:41  raft: election won: term=192 tally=2
2026-04-15T07:40:41  raft: entering leader state
2026-04-15T07:40:18  local state behind raft state, skipping recovery to allow catchup
                     diff=-166305  local_height=128706364  raft_height=128872669
2026-04-15T07:40:41  became leader but not synced, attempting recovery
2026-04-15T07:40:41  recovering state from raft  height=128872669

node-1 (same scenario, 2,151 blocks behind) briefly winning election during crash loop:

2026-04-15T07:40:10  local state behind raft state, skipping recovery to allow catchup
                     diff=-2151  local_height=128870518  raft_height=128872669
2026-04-15T07:40:10  became leader but not synced, attempting recovery

Execution replayer stuck at same count for 38 minutes:

2026-04-15T07:16:43  blocks_to_sync=166305  exec_layer_height=128706364  target_height=128872669
2026-04-15T07:17:41  blocks_to_sync=166305
2026-04-15T07:18:40  blocks_to_sync=166305
…
2026-04-15T07:54:36  blocks_to_sync=166305

Related

  • Raft leader re-election takes up to 90s after SIGTERM on a 3-node cluster #3229 — Slow re-election after SIGTERM (related: the ongoing crash loop that caused the divergence in the first place)
  • leader lock lost fatal crash — a node losing leadership exits the process instead of gracefully stepping down to follower. This is a separate but contributing issue: it is what triggered the repeated unsynchronized elections described here and what prevented recovery once node-2 was elected with a stale execution layer.

Metadata

Metadata

Assignees

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions