Is node 7 dead, or is its network link just slow? You can't tell from a missed response. Decide too quickly → false-positive, you kick a healthy node. Decide too slowly → real failures go unnoticed, requests time out for minutes. Every distributed system has to answer this question constantly, for hundreds of nodes. The answer is a failure-detection algorithm, driven by heartbeats.
Get it wrong and you thrash: healthy nodes marked dead, re-joined, re-dead, re-joined. Every replicated DB, LB, and orchestrator runs some variant of this machinery.
02
The simple model — timeout-based heartbeats
Each node periodically sends a heartbeat ("I'm alive") to its peers. If a peer hasn't heard from a node in T seconds → mark it suspect. T more seconds without heartbeat → mark it dead.
Two parameters rule everything:
Heartbeat interval (100ms – 1s). Too fast = network noise. Too slow = late detection.
Timeout (2–5× interval). Too tight = false positives on jitter. Too loose = real failures wait.
Timeout-based detection is simple but binary. A node is either "alive" or "dead" — no nuance for "seems slow today." Real systems want something better.
03
Phi accrual failure detector
Pioneered by Hayashibara et al. (2004), used in Cassandra and Akka. Instead of a binary dead/alive flag, maintain a running suspicion level φ that grows continuously as time passes without a heartbeat.
Track the history of inter-arrival times for a peer's heartbeats (rolling window).
Fit a probability distribution (usually normal) over that history.
At time t since last heartbeat, φ = −log10(P[next heartbeat arrives ≥ t]). As t grows, probability shrinks, φ rises.
Application subscribes at a threshold: "treat node as suspect when φ > 8" (about 1 in 10⁸ chance of being a false alarm). Adaptive to network conditions — if the peer's heartbeats are normally jittery, the distribution widens, the detector tolerates it. On a normally-stable peer, small delays already trigger suspicion.
04
SWIM — the push-less alternative
SWIM (Scalable Weakly-consistent Infection-style Membership) flips the script: instead of everyone pinging everyone, each tick a node pings one random peer. If the ping fails, it asks k other random peers to try indirectly. Only if all indirect probes also fail does it mark the node suspect.
Why it works: a direct ping can fail due to local network jitter between just two nodes. If three other nodes also can't reach the target, it's probably really down. Dramatic reduction in false positives vs timeout-based.
SWIM is what Hashicorp Serf, Consul, and Memberlist run. Scales to 10k+ nodes because ping traffic is O(1) per node per tick, not O(N).
See gossip protocols for how failure-detection state spreads through the cluster.
05
Deep dive — the tuning rule nobody tells you
All failure detectors face the same fundamental tradeoff (formalized by Chandra-Toueg, 1996):
Completeness — every dead node is eventually detected.
Accuracy — no live node is incorrectly suspected.
You can't have both perfectly. The tradeoff shifts with the environment:
Cross-region (100ms+ latency, occasional jitter) — use longer timeouts (10–30s) and adaptive detection.
Mobile / edge (unpredictable) — accept higher false-positive rate; design the app to tolerate reconnects rather than fighting for detection accuracy.
Critical production rule: when a node is marked dead, do not take destructive action immediately. Evict from rotation (easy to undo), re-route new requests, but don't drop existing connections or delete data for several minutes. Many "node down" events recover on their own.
Anti-pattern
Auto-failover that triggers in 5 seconds based on a single missed ping. One bad switch causes a stampede of failovers across the cluster. Use consensus-backed election (Raft) or require a quorum of nodes to agree on "X is down" before acting.
The Tuning Tradeoff — False Positives vs Detection LatencySVG