System Design — 017

Distributed Locking

How do you guarantee that at most one process operates on a shared resource at any time, across machines that can crash, networks that can partition, and clocks that can drift?

consensusfencing-tokensraftmutual-exclusionlinearizability
01

Problem Statement

On a single machine, locking is solved — your language gives you mutexes backed by shared memory and atomic CPU instructions. In a distributed system, there's no shared memory. Process A on Machine 1 and Process B on Machine 2 can only coordinate via the network, which is unreliable. The lock state must live somewhere both can access — a separate coordination service.

The central insight: A distributed lock, by itself, does NOT guarantee mutual exclusion. If a lock holder pauses (GC) and the lock expires (TTL), another holder acquires it. Two holders operate simultaneously. True mutual exclusion requires a lock service plus fencing tokens at the protected resource.

Three Categories of Locking Need

Tier 1 — Efficiency

Prevent duplicate work (cache stampede, job dedup). Violation = wasted work. Tolerable. Use Redis SET NX.

Tier 2 — Correctness

Protect shared state (wallet balance, leader election). Violation = data corruption, split brain. Catastrophic. Use consensus + fencing.

The most common production mistake: using a Tier 1 lock for a Tier 2 problem. Fast, simple, almost always correct — until the one time it isn't, and $50,000 vanishes from a user's account.

The Naive Lock & Its Four Failures

A database row lock (INSERT INTO locks ON CONFLICT DO NOTHING) fails in four ways: holder crashes → lock stuck forever (no TTL), TTL expires while holder works (two holders simultaneously), network partition (lock unreachable), and clock skew (TTL unreliable). Every component in our distributed design solves one of these.

02

Requirements

Functional Requirements

  • Acquire lock — blocking (wait for availability) and non-blocking (try, fail fast). Every acquisition returns a fencing token — a monotonically increasing counter. Non-negotiable.
  • Release lock — owner-verified (must present matching token). Prevents stale clients from releasing active locks.
  • Renew / extend — extend TTL without releasing. Background thread renews at 1/3 of TTL intervals. Renewal failure = lock lost signal.
  • Session management — clients establish sessions via persistent gRPC streams. All locks tied to session. Session death → all locks release.
  • Multi-lock — acquire multiple resources atomically with canonical ordering (deadlock prevention). All-or-nothing semantics.
  • Watch — observe lock events for monitoring dashboards. Not for control flow (TOCTOU race).

Non-Functional Requirements

  • Mutual exclusion (Safety) — at most one holder at any time. For Tier 2, this must hold under all failure modes — requiring consensus + fencing.
  • Liveness (Deadlock freedom) — every request eventually succeeds or fails. Guaranteed by TTL expiry + wait timeouts.
  • Low latency — uncontended acquire < 5ms (Tier 2) or < 1ms (Tier 1).
  • High availability — 99.99%. Survives single node failure (5-node cluster tolerates 2 failures).
  • Linearizability — every lock operation appears to happen at a single point in time. No stale reads from lagging replicas.
  • Fairness — FIFO wait queues. No starvation of slow clients.

CP, not AP: An AP lock service is a contradiction — a lock that doesn't guarantee mutual exclusion isn't a lock. During a network partition, the minority side must refuse to serve.

03

Scale Estimation

500 microservices, each making ~100 lock operations/sec. The bottleneck is not storage (186 MB fits in memory) — it's write throughput through consensus.

50K/s
Total lock operations/sec
500K
Active locks at any moment
186 MB
Total state (fits in RAM)
5-7ms
Consensus lock latency
~1ms
Redis Tier 1 latency
1K/s
Session keep-alives (batched)

Key Insight: Consensus-Bound, Not Storage-Bound

A single etcd cluster: 10-20K writes/sec. A single ZK cluster: 15-30K writes/sec. Our 41K writes/sec (after session batching) is at the ceiling of a single consensus group. This forces either aggressive batching/pipelining or namespace sharding at higher scale. Session-based renewal (1K/sec) vs per-lock renewal (35K/sec) is a 35× reduction — a mandatory optimization.

04

API Design

gRPC-based (not REST) — locks are latency-sensitive, connection-oriented (sessions), and benefit from streaming (watch, keep-alive). The client library wraps all complexity.

Acquire Lock
Lock(request):
  session_id: string
  resource_id: string          // "wallet:user_123"
  ttl: Duration                // 30 seconds
  wait_timeout: Duration       // 0 for tryLock
  tier: EFFICIENCY | CORRECTNESS

→ LockResponse:
  acquired: bool
  lock_handle: {
    resource_id, fence_token: uint64, expires_at
  }
Release Lock (Owner-Verified)
Unlock(request):
  session_id: string
  resource_id: string
  fence_token: uint64          // MUST match current token

→ UnlockResponse:
  released: bool
  reason: "ok" | "not_owner" | "already_released" | "expired"
Client Library — Safe by Default
client = LockServiceClient("lock-service:2379")

with client.lock("wallet:user_123", ttl=30, tier=CORRECTNESS) as lock:
    balance = db.read("wallet:user_123")
    db.write("wallet:user_123", balance - 100,
             fence_token=lock.fence_token)
# Lock auto-released when context exits

Fencing Token Contract

Monotonically increasing per resource. Unique per acquisition. Survives leader failover (part of consensus state). The protected resource must reject writes with token < its high-water mark.

Session Model

Persistent gRPC stream with keep-alive every 10s. Session TTL: 30s (3 missed keep-alives → expired). All locks tied to session — one death releases all.

05

High-Level Architecture

A 5-node Raft-replicated state machine storing lock ownership, sessions, fence tokens, and wait queues in memory (~186 MB), with a smart client library handling session management and tier-based routing.

Client Library Thick SDK Redis Tier 1 (~1ms) Raft Leader Tier 2 (~5-7ms) Follower 1 Replica Follower 2 Replica Follower 3 Replica Follower 4 Replica Lock State Machine locks, sessions, fence_counters wait_queues (186 MB in RAM) Event Streamer → Kafka → Dashboards Protected Resource DB with fence check Tier 1 Tier 2 Replicate Apply Events fence_token
ComponentTechWhy
Client LibraryThick SDKSession mgmt, renewal, circuit breaker, tier routing, fencing tracking
Raft Cluster5-node consensusLinearizable lock state, survives 2 failures, fencing tokens never go backwards
Lock State MachineIn-memory (186 MB)Locks, sessions, fence counters, FIFO wait queues — all consensus-replicated
Redis SidecarRedis instanceTier 1 fast path (~1ms) for efficiency locks
Session ManagerIn-processKeep-alive tracking, dead-client detection, batch lock cleanup
Event StreamerRing buffer → KafkaObservability without adding latency to critical path
Request Flow — Step Through
Client LibSessionRaft LeaderFollowersState MachineClient Gets TokenResource WriteFence Check
Click Next Step to walk through the request flow.
06

Deep Dive — Consensus & Fencing

The spectrum from "fast and best-effort" to "slow and provably correct":

LevelApproachLinearizableSurvives Slow ClientTrue Mutual Exclusion
1Single Redis SET NX EXNoNoNo
2Redlock (5 Redis nodes)NoNoNo
3ZooKeeper / etcd (consensus)YesNoNo*
4Consensus + Fencing TokensYesYesYes

*Consensus without fencing can be violated by a slow holder that outlives its lease and sends a stale write.

The Redlock Debate (Kleppmann vs Antirez)

Kleppmann's Critique

Redlock assumes bounded timing (GC pauses break validity calculations) and accurate clocks (NTP jumps cause premature expiry). If you add fencing tokens, you don't need Redlock — single Redis + fencing is equally correct.

Our Verdict

Kleppmann is correct on theory. Redlock occupies an awkward middle ground — more infrastructure than single Redis, but not actually linearizable. For correctness: use consensus + fencing. For efficiency: single Redis is simpler.

Fencing Token Flow

sequenceDiagram participant C_A as Client A participant LS as Lock Service (Raft) participant C_B as Client B participant DB as Database C_A->>LS: LOCK("wallet:123", ttl=30s) LS-->>C_A: acquired, token=42 C_A->>DB: WRITE balance=$400, fence_token=42 Note over C_A: GC pause 35 seconds... Note over LS: Session expires at T=30 LS->>LS: Release lock (session expired) C_B->>LS: LOCK("wallet:123", ttl=30s) LS-->>C_B: acquired, token=43 C_B->>DB: WRITE balance=$300, fence_token=43 DB->>DB: 43 > hwm(0) → accept, hwm=43 Note over C_A: Wakes from GC at T=35 C_A->>DB: WRITE balance=$400, fence_token=42 (delayed) DB->>DB: 42 < hwm(43) → REJECTED DB-->>C_A: StaleFenceTokenError

The proof: The database's high-water mark ensures that even if Client A's write was sent while the lock was validly held (at T=0.2), it's rejected when it arrives late (at T=35) because Client B's write (token=43) has already raised the bar. The fencing token at the resource is the actual safety mechanism — not the lock.

Why Consensus Locks Still Need Fencing

Even Raft-backed locks can be violated. The consensus protocol guarantees linearizable lock state (who the lock service thinks holds the lock). But it can't control what happens outside the lock service — a write can be in the TCP buffer when the lock expires. The write arrives at the database after a new holder's write. Without fencing, data corrupts. Consensus + fencing = true correctness. Neither alone is sufficient.

07

Key Design Decisions & Tradeoffs

CP vs AP Lock Service

✓ Chosen

CP — Consistency over Availability

Minority partition refuses to serve. Mutual exclusion always maintained. Clients get UNAVAILABLE during partition — safe.

✗ Alternative

AP — Availability over Consistency

Both sides serve during partition. Two holders simultaneously. An AP lock is a contradiction — never correct for Tier 2.

Session-Based vs TTL-Only Expiry

✓ Chosen

Session + Heartbeat

All locks tied to session. One keep-alive covers all locks (1K/sec vs 35K/sec). Dead client detection via missed heartbeats — faster than individual TTL expiry.

✗ Alternative

Per-Lock TTL (Redis-style)

Each lock independently expires. 35K renewals/sec overhead. Simpler (no sessions) but slower dead-client detection and 35× more renewal traffic.

Consensus Protocol

✓ Chosen

Raft

Designed for clarity. Excellent library ecosystem (etcd/raft, hashicorp/raft). Single leader handles our throughput. Well-defined membership changes.

✗ Alternative

Multi-Paxos

Higher theoretical throughput (multi-proposer). Used by Google Chubby. But notoriously difficult to implement correctly. Only justified at Google scale.

Fencing Enforcement Location

✓ Chosen

Resource-Side Enforcement

Lock service issues tokens. Resource checks them. Loose coupling — lock service doesn't need to understand every resource type. Scales naturally.

✗ Alternative

Lock-Service-Side Enforcement

Lock service proxies all writes, attaching tokens. Guaranteed enforcement but tight coupling — lock service becomes bottleneck for all writes.

FIFO vs Competitive Lock Acquisition

✓ Chosen

FIFO Wait Queues

Waiters served in order. No starvation. Single-successor notification (no thundering herd). ZooKeeper's sequential nodes do this naturally.

✗ Alternative

Competitive (Race on Release)

All waiters race. Fastest wins. Simpler but starvation risk and thundering herd spike on every release. Bad for multi-tenant fairness.

08

What Can Go Wrong

Lock Holder Dies

gRPC stream breaks → detected in ~100-500ms (TCP close) or up to session TTL (~30s, missed keep-alives). Session expires → all locks release atomically in one consensus entry → next waiters notified. Fencing tokens protect against any writes the dead client sent before crashing.

Slow Client (GC Pause Outlives Lock) — THE Critical Failure

Client A acquires lock (token=42), pauses for 35 seconds. Lock expires at 30s. Client B acquires (token=43). Both operate simultaneously. No lock service can prevent this — the write may be in the TCP buffer when the pause starts. Fencing token at the database is the only defense: B's write sets high-water mark to 43, A's delayed write (token=42 < 43) is rejected.

Network Partition

Client on minority side can't reach lock service. Session eventually expires. Client continues operating, unaware lock is lost. Defense: client library tracks last successful keep-alive — warns application when nearing session TTL. Plus fencing tokens reject stale writes from the partitioned client.

Leader Failover

2-3 second disruption during election. No lock state lost — all committed entries exist on majority. Clients get UNAVAILABLE, retry automatically. Existing locks remain valid (TTLs tracked on all nodes). Fence token counters never go backwards (consensus-replicated).

Clock Skew

Lock TTLs use logical ticks (Raft heartbeat intervals), not wall clocks. All nodes agree on the tick count via consensus. No clock drift affects expiry decisions. Wall clocks used only for logging and external-facing timestamps.

Thundering Herd on Hot Lock Release

200 waiters on a single lock. With broadcast notification: 200 acquire attempts, 199 wasted. With FIFO single-successor notification: only the next waiter is notified. Zero wasted requests. ZooKeeper's watch chain does this naturally.

Deadlock (Circular Wait)

Client A holds lock 1, wants lock 2. Client B holds lock 2, wants lock 1. Circular wait → both stuck. Defense: wait timeouts (breaks deadlock after N seconds), canonical ordering in MultiLock (sort resource IDs, acquire in order — no cycles possible), optional deadlock detection via wait-for graph.

09

Interview Tips

💡
Lead with the central insight.
"A distributed lock by itself does NOT guarantee mutual exclusion. A slow holder can outlive the TTL. True safety requires a lock service plus fencing tokens at the protected resource." This one statement separates you from 95% of candidates.
Walk through the 5-level progression.
In-process mutex → database row lock → Redis SET NX → Redlock → Consensus + fencing. Each level breaks and motivates the next. Takes 90 seconds but tells a complete story. The design feels derived, not memorized.
🎯
Know the Redlock debate cold.
Kleppmann's critique (bounded timing assumption, clock dependency) vs Antirez's response (add fencing on top). Your verdict: "If you have fencing, you don't need Redlock. If you don't have fencing, Redlock doesn't save you."
🔑
Draw the fencing token flow.
Client A (token=42) pauses. Lock expires. Client B (token=43) writes. Client A's delayed write arrives. Database: 42 < 43 → REJECTED. This concrete visual makes abstract concepts tangible.
🧠
Distinguish Tier 1 from Tier 2.
"The consequence of lock violation determines the design. Cache stampede? Redis is fine. Financial transaction? Consensus + fencing. Most production bugs come from using a Tier 1 lock for a Tier 2 problem."
🎯
Never claim your lock prevents all concurrency issues.
"Locks prevent concurrent access. Database transactions prevent partial updates. You often need both. If the holder crashes mid-transaction, the lock releases but partial writes persist — the transaction rolls back, not the lock."
10

Similar Problems

Distributed Job Scheduler

Uses locks for exactly-once execution (Redis leases) and leader election (ZooKeeper). Lock service is the foundation.

Leader Election

Special case of locking — single long-lived lock. Fencing token becomes the epoch number for all downstream operations.

Distributed Transactions (2PC)

Locks provide mutual exclusion; transactions provide atomicity. Often used together. MultiLock parallels 2PC's prepare phase.

Optimistic Concurrency Control

Fencing tokens ARE OCC — version check at write time. Our design is a hybrid: pessimistic (lock) + optimistic (fence check).

Rate Limiter

Same distributed Redis state, time-based expiry, clock skew challenges. Different correctness requirement (AP vs CP).

Distributed Semaphore

Generalization from max_holders=1 to max_holders=N. Same sessions, TTLs, and wait queues. Fencing tokens less useful.

11

Evolution

Each stage is triggered by a specific failure or scale ceiling. The one exception: fencing tokens should be designed in from the start — cheap to implement, impossible to retrofit safely.

1

In-Process Mutex — 1 server

Language-level lock. Zero infrastructure. Breaks the moment you add a second server — no shared memory across processes.

2

Database Row Lock — 2-10 servers

INSERT with unique constraint or pg_try_advisory_lock. Breaks when holder crashes → lock stuck forever (no TTL).

3

Redis SET NX EX — 2-50 servers, <10K locks/sec

TTL solves stuck locks. ~1ms latency. Breaks on failover (async replication loses locks) and slow holders (TTL outlived). Fine for Tier 1.

4

Consensus (etcd/ZooKeeper) — 10-200 servers

Linearizable lock state. Sessions with keep-alive. FIFO wait queues. Fencing tokens via revision numbers. ~5-15ms. Breaks at 30K+ ops/sec (consensus ceiling).

5

Two-Tier Custom Service — 50-500 services (Target Design)

Tier 1 (Redis ~1ms) + Tier 2 (Raft ~5-7ms). Thick client library. Purpose-built fencing tokens, FIFO queues, session management. The design we built in this document.

6

Sharded Multi-Cluster — 500+ services

Shard lock namespace across N Raft groups (hash(resource) % N). Linear throughput scaling. Per-shard blast radius. Cross-shard MultiLock not supported.

7

Multi-Region Global — 1000+ services, global

Region-local clusters. Global lock coordinator for cross-region resources. Spanner-style TrueTime for global fencing token ordering (aspirational). Data sovereignty compliance.

Next up