Concept · Distributed Systems

Read Repair & Anti-Entropy

01

Why this matters

You replicate data across 3 nodes for durability. A network blip causes node 2 to miss a write. Now node 1 + 3 have v2; node 2 has v1. Replicas have diverged. Without active repair, this divergence accumulates — the longer the system runs, the more inconsistencies pile up. Eventually you read from node 2 and get stale data forever.

Dynamo-style systems (Cassandra, DynamoDB, Riak) use two complementary mechanisms: read repair (fix on the fly during reads) + anti-entropy (background sweep using Merkle trees). Together they keep replicas converging even under sustained partial failures.

02

Read repair — opportunistic fix

On a quorum read (R=3 in N=3 system), the coordinator gets responses from multiple replicas. If they disagree:

  1. Take the most recent value (per timestamp / vector clock).
  2. Return it to the client.
  3. In the background, push the latest value to whichever replicas had stale versions.

Cost: zero on the read path (the fix happens after the response). Benefit: every read on a divergent key fixes it. Hot keys converge in seconds.

Variants:

  • Foreground read repair — fix synchronously before returning. Slower; ensures the next read sees consistent data.
  • Background read repair — fix async, return immediately. Faster; next read might still see stale.
  • Probabilistic read repair — sample 10% of reads, repair on those. Cheap; eventually converges.
Read-repair in quorum read
def read_with_repair(key, replicas, R):
    # Collect R responses
    responses = []
    for r in replicas:
        try: responses.append((r, r.get(key)))
        except: pass
        if len(responses) >= R: break

    if not responses: return None

    # Pick version with latest vector clock
    latest = max(responses, key=lambda x: x[1].version if x[1] else 0)

    # Read-repair: fire-and-forget writes to replicas with stale data
    for r, resp in responses:
        if resp is None or resp.version < latest[1].version:
            try: r.put(key, latest[1])  # non-blocking; doesn't affect caller
            except: pass

    return latest[1]

# Lightweight anti-entropy: stale replicas caught up on every read they serve.
03

Anti-entropy — the background sweep

Read repair only fixes keys that are read. Cold data — keys nobody reads — diverges silently. Anti-entropy sweeps the entire keyspace periodically.

Naive: replica A streams every key + value to replica B; B compares + repairs. Bandwidth = entire dataset. Useless at TB scale.

Practical: build a Merkle tree over each replica's key range. Replicas exchange roots first. Match → done, zero data transferred. Differ → recurse to identify exactly which sub-ranges diverge, transfer only those.

For 1 TB of identical data: one hash exchange. For 1 TB with 1000 differing keys: ~30 hash comparisons + 1000 row transfers. Bandwidth scales with divergence, not data size.

04

Combining the two

Read repair only

Fast convergence on hot data

Hot keys (top 1% by traffic) re-converge in seconds. Cold data drifts forever. Bad for archive workloads where most data isn't read often.

Read repair + anti-entropy

Both hot and cold converge

Read repair handles 99% of cases instantly. Weekly anti-entropy sweep catches the cold-key drift. Bandwidth amortized across days. The Cassandra default.

05

Deep dive — hinted handoff, the third leg

There's a third related mechanism that completes the triangle. When node A is temporarily unreachable during a write, the coordinator stores the write locally as a hint ("write this to A when it's back"). When A recovers, hints are streamed to it.

Three-mechanism playbook in Dynamo-style systems:

  1. Hinted handoff — handle short partitions. A is briefly unreachable; coordinator buffers writes; replays when A returns. Window: minutes to hours.
  2. Read repair — handle missed hints / longer outages. A came back but missed some hints. Reads expose the divergence; repair fixes on the fly. Window: seconds per hot key.
  3. Anti-entropy — final safety net. Cold keys never get read-repaired. Periodic Merkle-tree comparison closes the gap. Window: per scheduled run (weekly typical).

Together: the system always converges to consistency, even after extended failures, without requiring synchronous replication on the hot path.

Interview answer

"Eventual consistency requires active convergence. Three layers: hinted handoff for short partitions, read repair for hot-key drift, anti-entropy with Merkle trees for cold-key drift. Cassandra and DynamoDB run all three; together they guarantee replicas converge even when individual mechanisms miss."

06

Real-world

Cassandra

All three layers

Hinted handoff (3 hours default), read repair (10% of reads probabilistic), anti-entropy nodetool repair weekly.

DynamoDB internal

Same mechanisms

Original Dynamo paper (2007) introduced the pattern. AWS-internal DynamoDB uses all three; they're invisible to users.

Riak

Tunable per bucket

Read repair always on. Anti-entropy can be tuned per workload — heavy-write buckets repair more aggressively.

CockroachDB / TiDB

Don't need it

Strong-consistency systems use Raft replication — divergence can't happen. Read repair is unnecessary; anti-entropy unused.

07

Used in problems

News feed's Cassandra-backed feed store relies on read-repair for the hot users + anti-entropy for cold archive. WhatsApp's chat replicas use hinted handoff during regional partition recovery. Distributed logging uses Merkle-based anti-entropy for log-ingest reconciliation.

Next up