Concept · Distributed Systems

Leader Election

01

Why this matters

Distributed systems are full of "only one at a time" jobs: one cron runner, one database primary, one billing reconciler, one job scheduler. Without coordination, every replica does the same work — duplicate emails, duplicate charges, conflicting writes. Leader election picks exactly one node (the "leader") and the rest defer to it. If the leader dies, a new one is elected quickly.

This sounds easy. It's not. The failure mode — two nodes both think they're leader (split-brain) — is one of the most dangerous bugs in distributed systems.

02

The naive version and why it fails

Naive approach: every node holds a TTL lock in Redis/etcd ("I'm leader"). The holder renews every 5 seconds. If the lock expires, another node grabs it.

This breaks under GC pause. Leader A is holding the lock. A stops-the-world for 15 seconds (long GC, paging, CPU steal from a noisy neighbor). Lock expires. B acquires it, does leader work. A wakes up, still believes "I'm leader," continues writing. Two writers. Corruption.

So leader election can't just be "I hold a lock." It must be verifiable by anyone who reads from the leader's output. Enter fencing tokens (see distributed locking): each leader gets a monotonically-increasing token; downstream systems reject writes tagged with a stale token.

03

The mainstream approaches

1. Consensus-backed (Raft/Paxos). Nodes run a consensus protocol internally. The elected leader of the consensus group is the leader. This is what etcd, Zookeeper, and Consul provide. Downsides: consensus is expensive to run (~5 nodes, majority acks), overkill for simple "pick one cron runner."

2. Lock + fencing token. Store the lock and a monotonic counter in one atomic DB. Every acquire increments the counter. Downstream systems (DBs, APIs) reject stale tokens. Works great if your downstreams can enforce the check.

3. DB row with heartbeat. A leaders table: (resource_id, node_id, expires_at). Each node tries to insert its row with ON CONFLICT DO NOTHING (or equivalent). Winner updates expires_at every few seconds. Losers poll. Simple; works in Postgres.

4. Sticky by resource. Use consistent hashing to map each resource to a node deterministically. "Resource foo always handled by the node whose ID is the clockwise neighbor of hash(foo)." Implicit leader. Rebalances on node add/remove.

Raft leader election (pseudocode)
# Raft: followers → candidate → leader.
# Random election timeout (150–300ms) prevents split votes.

import random

class RaftNode:
    def __init__(self, node_id, peers):
        self.id = node_id
        self.peers = peers
        self.term = 0
        self.voted_for = None
        self.state = "follower"
        self.timeout = random.uniform(0.15, 0.30)

    def on_timeout(self):
        # Followers promote to candidate when leader silent too long
        self.state = "candidate"
        self.term += 1
        self.voted_for = self.id
        votes = 1
        for p in self.peers:
            if p.request_vote(self.term, self.id): votes += 1
        if votes > (len(self.peers) + 1) // 2:
            self.state = "leader"
            self.start_heartbeats()
        else:
            self.state = "follower"  # lost or split

    def request_vote(self, term, candidate_id):
        if term > self.term and self.voted_for is None:
            self.term, self.voted_for = term, candidate_id
            return True
        return False
04

Pick by what you already have

You already runUseWhy
KubernetesK8s lease objectsBuilt into K8s API. Client-go provides leaderelection package.
ZookeeperEphemeral sequential znodesClassic pattern — create ephemeral node; lowest sequence = leader. Zab handles correctness.
etcdConcurrency APIsBuilt-in leader election primitive (v3 concurrency API).
PostgresRow insert with leaseNo extra infra. Good enough for most cron-style workloads.
RedisSET NX + RedlockFast but best-effort. Not safe for correctness-critical leadership.
05

Deep dive — the heartbeat loop done right

Leader election requires a liveness signal: the leader heartbeats ("I'm alive"); if followers stop hearing it, they elect a new leader. The heartbeat loop has several parameters that all interact:

  • Heartbeat interval — how often leader sends. Typical: 100–500ms.
  • Heartbeat timeout — how long before a follower decides "leader is dead." Typical: 2–5× heartbeat interval (1–5s).
  • Election timeout — randomized window within which a follower will trigger an election. Typical: 150–300ms.
  • Lease duration — how long the leader's authority lasts before it must renew. Must be > heartbeat timeout.

The nightmare: if heartbeat timeout is too tight, normal network jitter causes spurious elections. Too loose, and a dead leader leaves the cluster leaderless for seconds. Raft's innovation was randomized timeouts — each follower picks a random election timeout in a range, making simultaneous bids (split votes) rare.

Tuning rule

Heartbeat interval = 100ms. Election timeout = random(150, 300)ms. Lease = 1s. Under these, a dead leader is replaced in ~300ms, and split votes are rare because each candidate's random timeout is different.

06

Real-world

Kubernetes controllers

Lease-based

Every controller uses k8s lease object for leader election. Scheduler, controller-manager, custom operators — all the same pattern.

Kafka controller

Zookeeper / KRaft

One broker is the "controller" that manages partition assignments. Elected via Zookeeper (old) or KRaft (new Raft-based).

Postgres failover

Patroni + etcd

Patroni coordinates Postgres replicas via etcd leader election. Primary dies → etcd lease expires → standby with latest WAL wins.

Distributed cron

Quartz clustered, Temporal

Quartz uses DB row locks for job-level leadership. Temporal uses its own cluster for durable workflow leadership.

07

Used in problems

Distributed job scheduler elects a leader per task partition. Distributed locking relies on leader election underneath. Stock exchange uses leader election for the matching engine (exactly one active matcher per symbol).

Next up