Five nodes each think they might be the leader. The old leader's network cable was unplugged; now it's plugged back in. Who's in charge? Who accepts writes? If two nodes both think they're leader, they both accept writes — split-brain, data corruption, hours of manual reconciliation.
Consensus algorithms (Paxos, Raft, Zab) solve "make N nodes agree on one value, even when some crash or the network flakes." Every serious distributed system uses one. Raft won in the last decade because it's understandable; Paxos is still running under Spanner and Google's internal stack.
02
The FLP result — why consensus is hard
Fischer-Lynch-Paterson (1985): in a purely asynchronous system, consensus is impossible if even one node can crash. You can't distinguish "node is slow" from "node is dead." So any algorithm either gives up liveness (might never decide) or gives up safety (might decide the wrong thing).
Real algorithms pick: Paxos and Raft preserve safety (never decide two different values), sacrifice liveness during pathological partitions (may fail to make progress temporarily). In practice, with reasonable timeouts, they make progress within milliseconds.
03
Raft in one page
Raft elects a leader; the leader replicates a log; followers commit entries when replicated to a majority.
Three roles. Follower (default), Candidate (trying to become leader), Leader (one at a time).
Terms. Time divides into numbered terms. Each term has at most one leader. Terms increase monotonically.
Election. Followers start an election timer (randomized 150–300ms). If no heartbeat from a leader before timeout, become candidate → increment term → vote for self → ask peers. A candidate with votes from a majority of nodes becomes leader.
Log replication. Client sends write to leader. Leader appends to its log, ships to followers. Once majority of followers have persisted, leader commits and tells followers. Followers apply to state machine in order.
Safety guarantee. A committed entry is never lost or overwritten. Even if the leader dies, the new leader will have it (because the old leader replicated to a majority before committing, and the new leader got votes from a majority → the two majorities overlap).
Raft Log ReplicationMermaid
sequenceDiagram
participant C as Client
participant L as Leader
participant F1 as Follower 1
participant F2 as Follower 2
C->>L: write(x=5)
L->>L: append to log (uncommitted)
par Replicate
L->>F1: AppendEntries(x=5)
F1-->>L: ok
and
L->>F2: AppendEntries(x=5)
F2-->>L: ok
end
Note over L: Majority ack → commit
L->>L: apply to state machine
L-->>C: success
L->>F1: AppendEntries(commit-idx=N)
L->>F2: AppendEntries(commit-idx=N)
F1->>F1: apply to state machine
F2->>F2: apply to state machine
Term Progression — Leader Failure & Re-ElectionMermaid
sequenceDiagram
participant L as Leader (term 1)
participant F1 as Follower 1
participant F2 as Follower 2
Note over L,F2: Term 1 — normal commit (~2ms)
L->>F1: heartbeat
L->>F2: heartbeat
F1-->>L: ack
F2-->>L: ack
Note over L: Leader crashes
Note over F1,F2: Election timeout fires (random 150-300ms)
F1->>F1: become Candidate, term=2, vote self
F1->>F2: RequestVote(term=2)
F2-->>F1: voteGranted
Note over F1: Majority (2/3) → Leader of term 2
Note over F1,F2: Term 2 — new leader resumes
F1->>F2: AppendEntries(term=2)
F2-->>F1: ack
Note over F1,F2: Election + recovery: ~300-500ms total
~2 ms
in-region commit (1 RTT to majority)
100-200 ms
cross-region commit
150-300 ms
election timeout (randomized)
~500 ms
total recovery from leader loss
04
Paxos vs Raft vs Zab
Algorithm
Origin
Used in
Notes
Paxos
Lamport, 1989
Chubby, Spanner, etcd (legacy)
Famously hard to understand. Many production variants (Multi-Paxos, EPaxos).
Raft
Ongaro & Ousterhout, 2014
etcd, Consul, CockroachDB, TiKV, RabbitMQ Streams
Designed for understandability. 90% of new systems pick Raft.
Zab
Zookeeper, 2008
Apache Zookeeper
Predates Raft. Very similar structure — leader-based, majority acks.
Viewstamped Replication
Oki & Liskov, 1988
Research, some fault-tolerant systems
Pre-Paxos academic work. Influenced both Paxos and Raft.
05
What consensus costs
Every write needs majority acknowledgment. 5-node cluster → at least 3 must ACK. One slow node is fine; two down and you're stalled.
Write latency = leader + slowest of the majority. Typically 1–5ms in-region; 100+ ms cross-region. Put consensus groups in one region.
Odd-numbered clusters. 3 or 5 nodes, not 4. An even cluster has no majority advantage (3-vs-3 split both sides lose).
Leader load. All writes go through the leader. Leader crashes → election (~150ms) → new leader. During election, the cluster is unavailable for writes.
Not for everything. Consensus is for metadata and critical state — locks, config, leader election, coordination. You don't run consensus on every tweet. Heavy data goes in separate, eventually-consistent stores that depend on the consensus tier for coordination.
06
Deep dive — the majority overlap theorem
The beautiful core insight of Raft (and Paxos): any two majorities of an N-node cluster must share at least one node.
Proof sketch: a majority is ⌈N/2⌉ + 1 nodes. Two disjoint majorities would need N/2+1 + N/2+1 = N+2 nodes, but we only have N. Contradiction.
Why this matters: a committed entry is replicated to a majority. A new leader needs votes from a majority. By overlap, the new leader must have voted with at least one node that has the committed entry → new leader has the entry (it picks the most-up-to-date log among voters). Committed entries survive leader changes.
This is the entire correctness argument for log-based consensus in one paragraph. Memorize it.
Interview answer
"Raft elects a leader via randomized timeouts; the leader replicates the log to followers; entries are committed when a majority persist them. Safety follows from two facts: only a majority can elect a leader, and any two majorities overlap — so a new leader always has every committed entry."
07
Real-world
etcd
Raft-based KV store
Powers Kubernetes. Holds cluster metadata. ~5-node cluster handling tens of thousands of writes/sec.
CockroachDB
Raft per range
Each shard (~64 MB range) has its own Raft group. Millions of concurrent Raft groups across the cluster.
Google Spanner
Paxos + TrueTime
Multi-Paxos for each shard. Combined with TrueTime for globally-consistent transactions.
Kafka KRaft
Raft for metadata
Kafka 3+ replaced Zookeeper with a self-hosted Raft cluster for broker metadata and leader election.
08
Used in problems
Google Docs uses consensus for the operation log (OT resolves in order). Distributed locking relies on consensus for fencing tokens. Distributed job scheduler uses Raft to elect the active scheduler.