Consensus — Paxos & Raft — Concept — System Design Portfolio

01

Why this matters

Five nodes each think they might be the leader. The old leader's network cable was unplugged; now it's plugged back in. Who's in charge? Who accepts writes? If two nodes both think they're leader, they both accept writes — split-brain, data corruption, hours of manual reconciliation.

Consensus algorithms (Paxos, Raft, Zab) solve "make N nodes agree on one value, even when some crash or the network flakes." Every serious distributed system uses one. Raft won in the last decade because it's understandable; Paxos is still running under Spanner and Google's internal stack.

02

The FLP result — why consensus is hard

Fischer-Lynch-Paterson (1985): in a purely asynchronous system, consensus is impossible if even one node can crash. You can't distinguish "node is slow" from "node is dead." So any algorithm either gives up liveness (might never decide) or gives up safety (might decide the wrong thing).

Real algorithms pick: Paxos and Raft preserve safety (never decide two different values), sacrifice liveness during pathological partitions (may fail to make progress temporarily). In practice, with reasonable timeouts, they make progress within milliseconds.

03

Raft in one page

Raft elects a leader; the leader replicates a log; followers commit entries when replicated to a majority.

Three roles. Follower (default), Candidate (trying to become leader), Leader (one at a time).

Terms. Time divides into numbered terms. Each term has at most one leader. Terms increase monotonically.

Election. Followers start an election timer (randomized 150–300ms). If no heartbeat from a leader before timeout, become candidate → increment term → vote for self → ask peers. A candidate with votes from a majority of nodes becomes leader.

Log replication. Client sends write to leader. Leader appends to its log, ships to followers. Once majority of followers have persisted, leader commits and tells followers. Followers apply to state machine in order.

Safety guarantee. A committed entry is never lost or overwritten. Even if the leader dies, the new leader will have it (because the old leader replicated to a majority before committing, and the new leader got votes from a majority → the two majorities overlap).

Raft Log ReplicationMermaid

sequenceDiagram participant C as Client participant L as Leader participant F1 as Follower 1 participant F2 as Follower 2 C->>L: write(x=5) L->>L: append to log (uncommitted) par Replicate L->>F1: AppendEntries(x=5) F1-->>L: ok and L->>F2: AppendEntries(x=5) F2-->>L: ok end Note over L: Majority ack → commit L->>L: apply to state machine L-->>C: success L->>F1: AppendEntries(commit-idx=N) L->>F2: AppendEntries(commit-idx=N) F1->>F1: apply to state machine F2->>F2: apply to state machine

Term Progression — Leader Failure & Re-Election Mermaid

sequenceDiagram participant L as Leader (term 1) participant F1 as Follower 1 participant F2 as Follower 2 Note over L,F2: Term 1 — normal commit (~2ms) L->>F1: heartbeat L->>F2: heartbeat F1-->>L: ack F2-->>L: ack Note over L: Leader crashes Note over F1,F2: Election timeout fires (random 150-300ms) F1->>F1: become Candidate, term=2, vote self F1->>F2: RequestVote(term=2) F2-->>F1: voteGranted Note over F1: Majority (2/3) → Leader of term 2 Note over F1,F2: Term 2 — new leader resumes F1->>F2: AppendEntries(term=2) F2-->>F1: ack Note over F1,F2: Election + recovery: ~300-500ms total

~2 ms

in-region commit (1 RTT to majority)

100-200 ms

cross-region commit

150-300 ms

election timeout (randomized)

~500 ms

total recovery from leader loss

04

Paxos vs Raft vs Zab

Algorithm	Origin	Used in	Notes
Paxos	Lamport, 1989	Chubby, Spanner, etcd (legacy)	Famously hard to understand. Many production variants (Multi-Paxos, EPaxos).
Raft	Ongaro & Ousterhout, 2014	etcd, Consul, CockroachDB, TiKV, RabbitMQ Streams	Designed for understandability. 90% of new systems pick Raft.
Zab	Zookeeper, 2008	Apache Zookeeper	Predates Raft. Very similar structure — leader-based, majority acks.
Viewstamped Replication	Oki & Liskov, 1988	Research, some fault-tolerant systems	Pre-Paxos academic work. Influenced both Paxos and Raft.

05

What consensus costs

Every write needs majority acknowledgment. 5-node cluster → at least 3 must ACK. One slow node is fine; two down and you're stalled.
Write latency = leader + slowest of the majority. Typically 1–5ms in-region; 100+ ms cross-region. Put consensus groups in one region.
Odd-numbered clusters. 3 or 5 nodes, not 4. An even cluster has no majority advantage (3-vs-3 split both sides lose).
Leader load. All writes go through the leader. Leader crashes → election (~150ms) → new leader. During election, the cluster is unavailable for writes.
Not for everything. Consensus is for metadata and critical state — locks, config, leader election, coordination. You don't run consensus on every tweet. Heavy data goes in separate, eventually-consistent stores that depend on the consensus tier for coordination.

06

Deep dive — the majority overlap theorem

The beautiful core insight of Raft (and Paxos): any two majorities of an N-node cluster must share at least one node.

Proof sketch: a majority is ⌈N/2⌉ + 1 nodes. Two disjoint majorities would need N/2+1 + N/2+1 = N+2 nodes, but we only have N. Contradiction.

Why this matters: a committed entry is replicated to a majority. A new leader needs votes from a majority. By overlap, the new leader must have voted with at least one node that has the committed entry → new leader has the entry (it picks the most-up-to-date log among voters). Committed entries survive leader changes.

This is the entire correctness argument for log-based consensus in one paragraph. Memorize it.

Interview answer

"Raft elects a leader via randomized timeouts; the leader replicates the log to followers; entries are committed when a majority persist them. Safety follows from two facts: only a majority can elect a leader, and any two majorities overlap — so a new leader always has every committed entry."

07

Real-world

etcd

Raft-based KV store

Powers Kubernetes. Holds cluster metadata. ~5-node cluster handling tens of thousands of writes/sec.

CockroachDB

Raft per range

Each shard (~64 MB range) has its own Raft group. Millions of concurrent Raft groups across the cluster.

Google Spanner

Paxos + TrueTime

Multi-Paxos for each shard. Combined with TrueTime for globally-consistent transactions.

Kafka KRaft

Raft for metadata

Kafka 3+ replaced Zookeeper with a self-hosted Raft cluster for broker metadata and leader election.

08

Used in problems

Google Docs uses consensus for the operation log (OT resolves in order). Distributed locking relies on consensus for fencing tokens. Distributed job scheduler uses Raft to elect the active scheduler.

📺

References & Videos

Raft Consensus Algorithm

Martin Kleppmann · 20 min

Paxos vs Raft — Consensus Explained

Arpit Bhayani · 25 min

Consensus in Distributed Systems

Gaurav Sen · 15 min

The Raft Consensus Algorithm

raft.github.io

Distributed Locking & Consensus

Martin Kleppmann

Consensus — Paxos & Raft