Concept · Distributed Systems

Heartbeat & Failure Detection

01

Why this matters

Is node 7 dead, or is its network link just slow? You can't tell from a missed response. Decide too quickly → false-positive, you kick a healthy node. Decide too slowly → real failures go unnoticed, requests time out for minutes. Every distributed system has to answer this question constantly, for hundreds of nodes. The answer is a failure-detection algorithm, driven by heartbeats.

Get it wrong and you thrash: healthy nodes marked dead, re-joined, re-dead, re-joined. Every replicated DB, LB, and orchestrator runs some variant of this machinery.

02

The simple model — timeout-based heartbeats

Each node periodically sends a heartbeat ("I'm alive") to its peers. If a peer hasn't heard from a node in T seconds → mark it suspect. T more seconds without heartbeat → mark it dead.

Two parameters rule everything:

  • Heartbeat interval (100ms – 1s). Too fast = network noise. Too slow = late detection.
  • Timeout (2–5× interval). Too tight = false positives on jitter. Too loose = real failures wait.

Timeout-based detection is simple but binary. A node is either "alive" or "dead" — no nuance for "seems slow today." Real systems want something better.

03

Phi accrual failure detector

Pioneered by Hayashibara et al. (2004), used in Cassandra and Akka. Instead of a binary dead/alive flag, maintain a running suspicion level φ that grows continuously as time passes without a heartbeat.

  • Track the history of inter-arrival times for a peer's heartbeats (rolling window).
  • Fit a probability distribution (usually normal) over that history.
  • At time t since last heartbeat, φ = −log10(P[next heartbeat arrives ≥ t]). As t grows, probability shrinks, φ rises.

Application subscribes at a threshold: "treat node as suspect when φ > 8" (about 1 in 10⁸ chance of being a false alarm). Adaptive to network conditions — if the peer's heartbeats are normally jittery, the distribution widens, the detector tolerates it. On a normally-stable peer, small delays already trigger suspicion.

04

SWIM — the push-less alternative

SWIM (Scalable Weakly-consistent Infection-style Membership) flips the script: instead of everyone pinging everyone, each tick a node pings one random peer. If the ping fails, it asks k other random peers to try indirectly. Only if all indirect probes also fail does it mark the node suspect.

Why it works: a direct ping can fail due to local network jitter between just two nodes. If three other nodes also can't reach the target, it's probably really down. Dramatic reduction in false positives vs timeout-based.

SWIM is what Hashicorp Serf, Consul, and Memberlist run. Scales to 10k+ nodes because ping traffic is O(1) per node per tick, not O(N).

See gossip protocols for how failure-detection state spreads through the cluster.

05

Deep dive — the tuning rule nobody tells you

All failure detectors face the same fundamental tradeoff (formalized by Chandra-Toueg, 1996):

  • Completeness — every dead node is eventually detected.
  • Accuracy — no live node is incorrectly suspected.

You can't have both perfectly. The tradeoff shifts with the environment:

  • Datacenter (sub-ms latency, minimal jitter) — tight timeouts work. 3× heartbeat interval ≈ 1.5s is safe.
  • Cross-region (100ms+ latency, occasional jitter) — use longer timeouts (10–30s) and adaptive detection.
  • Mobile / edge (unpredictable) — accept higher false-positive rate; design the app to tolerate reconnects rather than fighting for detection accuracy.

Critical production rule: when a node is marked dead, do not take destructive action immediately. Evict from rotation (easy to undo), re-route new requests, but don't drop existing connections or delete data for several minutes. Many "node down" events recover on their own.

Anti-pattern

Auto-failover that triggers in 5 seconds based on a single missed ping. One bad switch causes a stampede of failovers across the cluster. Use consensus-backed election (Raft) or require a quorum of nodes to agree on "X is down" before acting.

The Tuning Tradeoff — False Positives vs Detection Latency SVG
false positive rate timeout duration → (longer = slower detection) 1s timeout FP ~20% · detect 1s 5s timeout FP ~5% · detect 5s 30s timeout FP ~0.5% · detect 30s tighter
100 ms
heartbeat interval (in-DC)
3-5×
timeout in interval units
k=3
SWIM indirect-probe peers
φ=8
Cassandra phi-accrual default threshold
06

Real-world

Cassandra

Phi accrual

Ring-wide failure detection via gossip + adaptive φ. Tuneable threshold (default 8).

Consul / Serf

SWIM

Push-less; probes random peers. Default 1s tick. Indirect probes through k=3 peers.

Kubernetes

Readiness + liveness probes

Simple timeout-based. kubelet probes each pod; 3 failures → unready → pulled from service endpoints.

Zookeeper / etcd

Session-based

Clients hold sessions with heartbeats. Session expiry = client dead → ephemeral znodes cleaned up.

07

Used in problems

Distributed locking relies on heartbeat-driven lease renewal. Distributed job scheduler detects worker failures via heartbeats. Stock exchange uses low-timeout detection to re-elect matching engines within 200ms. Count active users literally counts heartbeat signals.

Next up