Distributed Locking

01

Problem Statement

On a single machine, locking is solved — your language gives you mutexes backed by shared memory and atomic CPU instructions. In a distributed system, there's no shared memory. Process A on Machine 1 and Process B on Machine 2 can only coordinate via the network, which is unreliable. The lock state must live somewhere both can access — a separate coordination service.

The central insight: A distributed lock, by itself, does NOT guarantee mutual exclusion. If a lock holder pauses (GC) and the lock expires (TTL), another holder acquires it. Two holders operate simultaneously. True mutual exclusion requires a lock service plus fencing tokens at the protected resource.

Three Categories of Locking Need

Tier 1 — Efficiency

Prevent duplicate work (cache stampede, job dedup). Violation = wasted work. Tolerable. Use Redis SET NX.

Tier 2 — Correctness

Protect shared state (wallet balance, leader election). Violation = data corruption, split brain. Catastrophic. Use consensus + fencing.

The most common production mistake: using a Tier 1 lock for a Tier 2 problem. Fast, simple, almost always correct — until the one time it isn't, and $50,000 vanishes from a user's account.

The Naive Lock & Its Four Failures

A database row lock (INSERT INTO locks ON CONFLICT DO NOTHING) fails in four ways: holder crashes → lock stuck forever (no TTL), TTL expires while holder works (two holders simultaneously), network partition (lock unreachable), and clock skew (TTL unreliable). Every component in our distributed design solves one of these.

02

Requirements

Functional Requirements

Acquire lock — blocking (wait for availability) and non-blocking (try, fail fast). Every acquisition returns a fencing token — a monotonically increasing counter. Non-negotiable.
Release lock — owner-verified (must present matching token). Prevents stale clients from releasing active locks.
Renew / extend — extend TTL without releasing. Background thread renews at 1/3 of TTL intervals. Renewal failure = lock lost signal.
Session management — clients establish sessions via persistent gRPC streams. All locks tied to session. Session death → all locks release.
Multi-lock — acquire multiple resources atomically with canonical ordering (deadlock prevention). All-or-nothing semantics.
Watch — observe lock events for monitoring dashboards. Not for control flow (TOCTOU race).

Non-Functional Requirements

Mutual exclusion (Safety) — at most one holder at any time. For Tier 2, this must hold under all failure modes — requiring consensus + fencing.
Liveness (Deadlock freedom) — every request eventually succeeds or fails. Guaranteed by TTL expiry + wait timeouts.
Low latency — uncontended acquire < 5ms (Tier 2) or < 1ms (Tier 1).
High availability — 99.99%. Survives single node failure (5-node cluster tolerates 2 failures).
Linearizability — every lock operation appears to happen at a single point in time. No stale reads from lagging replicas.
Fairness — FIFO wait queues. No starvation of slow clients.

CP, not AP: An AP lock service is a contradiction — a lock that doesn't guarantee mutual exclusion isn't a lock. During a network partition, the minority side must refuse to serve.

03

Scale Estimation

500 microservices, each making ~100 lock operations/sec. The bottleneck is not storage (186 MB fits in memory) — it's write throughput through consensus.

50K/s

Total lock operations/sec

500K

Active locks at any moment

186 MB

Total state (fits in RAM)

5-7ms

Consensus lock latency

~1ms

Redis Tier 1 latency

1K/s

Session keep-alives (batched)

Key Insight: Consensus-Bound, Not Storage-Bound

A single etcd cluster: 10-20K writes/sec. A single ZK cluster: 15-30K writes/sec. Our 41K writes/sec (after session batching) is at the ceiling of a single consensus group. This forces either aggressive batching/pipelining or namespace sharding at higher scale. Session-based renewal (1K/sec) vs per-lock renewal (35K/sec) is a 35× reduction — a mandatory optimization.

04

API Design

gRPC-based (not REST) — locks are latency-sensitive, connection-oriented (sessions), and benefit from streaming (watch, keep-alive). The client library wraps all complexity.

Acquire Lock

Lock(request):
  session_id: string
  resource_id: string          // "wallet:user_123"
  ttl: Duration                // 30 seconds
  wait_timeout: Duration       // 0 for tryLock
  tier: EFFICIENCY | CORRECTNESS

→ LockResponse:
  acquired: bool
  lock_handle: {
    resource_id, fence_token: uint64, expires_at
  }

Release Lock (Owner-Verified)

Unlock(request):
  session_id: string
  resource_id: string
  fence_token: uint64          // MUST match current token

→ UnlockResponse:
  released: bool
  reason: "ok" | "not_owner" | "already_released" | "expired"

Client Library — Safe by Default

client = LockServiceClient("lock-service:2379")

with client.lock("wallet:user_123", ttl=30, tier=CORRECTNESS) as lock:
    balance = db.read("wallet:user_123")
    db.write("wallet:user_123", balance - 100,
             fence_token=lock.fence_token)
# Lock auto-released when context exits

Fencing Token Contract

Monotonically increasing per resource. Unique per acquisition. Survives leader failover (part of consensus state). The protected resource must reject writes with token < its high-water mark.

Session Model

Persistent gRPC stream with keep-alive every 10s. Session TTL: 30s (3 missed keep-alives → expired). All locks tied to session — one death releases all.

05

High-Level Architecture

A 5-node Raft-replicated state machine storing lock ownership, sessions, fence tokens, and wait queues in memory (~186 MB), with a smart client library handling session management and tier-based routing.

Component	Tech	Why
Client Library	Thick SDK	Session mgmt, renewal, circuit breaker, tier routing, fencing tracking
Raft Cluster	5-node consensus	Linearizable lock state, survives 2 failures, fencing tokens never go backwards
Lock State Machine	In-memory (186 MB)	Locks, sessions, fence counters, FIFO wait queues — all consensus-replicated
Redis Sidecar	Redis instance	Tier 1 fast path (~1ms) for efficiency locks
Session Manager	In-process	Keep-alive tracking, dead-client detection, batch lock cleanup
Event Streamer	Ring buffer → Kafka	Observability without adding latency to critical path

Request Flow — Step Through

Client Lib→Session→Raft Leader→Followers→State Machine→Client Gets Token→Resource Write→Fence Check

Click Next Step to walk through the request flow.

06

Deep Dive — Consensus & Fencing

The spectrum from "fast and best-effort" to "slow and provably correct":

Level	Approach	Linearizable	Survives Slow Client	True Mutual Exclusion
1	Single Redis SET NX EX	No	No	No
2	Redlock (5 Redis nodes)	No	No	No
3	ZooKeeper / etcd (consensus)	Yes	No	No*
4	Consensus + Fencing Tokens	Yes	Yes	Yes

*Consensus without fencing can be violated by a slow holder that outlives its lease and sends a stale write.

The Redlock Debate (Kleppmann vs Antirez)

Kleppmann's Critique

Redlock assumes bounded timing (GC pauses break validity calculations) and accurate clocks (NTP jumps cause premature expiry). If you add fencing tokens, you don't need Redlock — single Redis + fencing is equally correct.

Our Verdict

Kleppmann is correct on theory. Redlock occupies an awkward middle ground — more infrastructure than single Redis, but not actually linearizable. For correctness: use consensus + fencing. For efficiency: single Redis is simpler.

Fencing Token Flow

sequenceDiagram participant C_A as Client A participant LS as Lock Service (Raft) participant C_B as Client B participant DB as Database C_A->>LS: LOCK("wallet:123", ttl=30s) LS-->>C_A: acquired, token=42 C_A->>DB: WRITE balance=$400, fence_token=42 Note over C_A: GC pause 35 seconds... Note over LS: Session expires at T=30 LS->>LS: Release lock (session expired) C_B->>LS: LOCK("wallet:123", ttl=30s) LS-->>C_B: acquired, token=43 C_B->>DB: WRITE balance=$300, fence_token=43 DB->>DB: 43 > hwm(0) → accept, hwm=43 Note over C_A: Wakes from GC at T=35 C_A->>DB: WRITE balance=$400, fence_token=42 (delayed) DB->>DB: 42 < hwm(43) → REJECTED DB-->>C_A: StaleFenceTokenError

The proof: The database's high-water mark ensures that even if Client A's write was sent while the lock was validly held (at T=0.2), it's rejected when it arrives late (at T=35) because Client B's write (token=43) has already raised the bar. The fencing token at the resource is the actual safety mechanism — not the lock.

Why Consensus Locks Still Need Fencing

Even Raft-backed locks can be violated. The consensus protocol guarantees linearizable lock state (who the lock service thinks holds the lock). But it can't control what happens outside the lock service — a write can be in the TCP buffer when the lock expires. The write arrives at the database after a new holder's write. Without fencing, data corrupts. Consensus + fencing = true correctness. Neither alone is sufficient.

07

All waiters race. Fastest wins. Simpler but starvation risk and thundering herd spike on every release. Bad for multi-tenant fairness.

08

What Can Go Wrong

Lock Holder Dies

gRPC stream breaks → detected in ~100-500ms (TCP close) or up to session TTL (~30s, missed keep-alives). Session expires → all locks release atomically in one consensus entry → next waiters notified. Fencing tokens protect against any writes the dead client sent before crashing.

Slow Client (GC Pause Outlives Lock) — THE Critical Failure

Client A acquires lock (token=42), pauses for 35 seconds. Lock expires at 30s. Client B acquires (token=43). Both operate simultaneously. No lock service can prevent this — the write may be in the TCP buffer when the pause starts. Fencing token at the database is the only defense: B's write sets high-water mark to 43, A's delayed write (token=42 < 43) is rejected.

Network Partition

Client on minority side can't reach lock service. Session eventually expires. Client continues operating, unaware lock is lost. Defense: client library tracks last successful keep-alive — warns application when nearing session TTL. Plus fencing tokens reject stale writes from the partitioned client.

Leader Failover

2-3 second disruption during election. No lock state lost — all committed entries exist on majority. Clients get UNAVAILABLE, retry automatically. Existing locks remain valid (TTLs tracked on all nodes). Fence token counters never go backwards (consensus-replicated).

Clock Skew

Lock TTLs use logical ticks (Raft heartbeat intervals), not wall clocks. All nodes agree on the tick count via consensus. No clock drift affects expiry decisions. Wall clocks used only for logging and external-facing timestamps.

Thundering Herd on Hot Lock Release

200 waiters on a single lock. With broadcast notification: 200 acquire attempts, 199 wasted. With FIFO single-successor notification: only the next waiter is notified. Zero wasted requests. ZooKeeper's watch chain does this naturally.

Deadlock (Circular Wait)

Client A holds lock 1, wants lock 2. Client B holds lock 2, wants lock 1. Circular wait → both stuck. Defense: wait timeouts (breaks deadlock after N seconds), canonical ordering in MultiLock (sort resource IDs, acquire in order — no cycles possible), optional deadlock detection via wait-for graph.

09

Interview Tips

💡

Lead with the central insight.
"A distributed lock by itself does NOT guarantee mutual exclusion. A slow holder can outlive the TTL. True safety requires a lock service plus fencing tokens at the protected resource." This one statement separates you from 95% of candidates.

⚡

Walk through the 5-level progression.
In-process mutex → database row lock → Redis SET NX → Redlock → Consensus + fencing. Each level breaks and motivates the next. Takes 90 seconds but tells a complete story. The design feels derived, not memorized.

🎯

Know the Redlock debate cold.
Kleppmann's critique (bounded timing assumption, clock dependency) vs Antirez's response (add fencing on top). Your verdict: "If you have fencing, you don't need Redlock. If you don't have fencing, Redlock doesn't save you."

🔑

Draw the fencing token flow.
Client A (token=42) pauses. Lock expires. Client B (token=43) writes. Client A's delayed write arrives. Database: 42 < 43 → REJECTED. This concrete visual makes abstract concepts tangible.

🧠

Distinguish Tier 1 from Tier 2.
"The consequence of lock violation determines the design. Cache stampede? Redis is fine. Financial transaction? Consensus + fencing. Most production bugs come from using a Tier 1 lock for a Tier 2 problem."

🎯

Never claim your lock prevents all concurrency issues.
"Locks prevent concurrent access. Database transactions prevent partial updates. You often need both. If the holder crashes mid-transaction, the lock releases but partial writes persist — the transaction rolls back, not the lock."

10

Evolution

Each stage is triggered by a specific failure or scale ceiling. The one exception: fencing tokens should be designed in from the start — cheap to implement, impossible to retrofit safely.

1

In-Process Mutex — 1 server

Language-level lock. Zero infrastructure. Breaks the moment you add a second server — no shared memory across processes.

2

Database Row Lock — 2-10 servers

INSERT with unique constraint or pg_try_advisory_lock. Breaks when holder crashes → lock stuck forever (no TTL).

3

Redis SET NX EX — 2-50 servers, <10K locks/sec

TTL solves stuck locks. ~1ms latency. Breaks on failover (async replication loses locks) and slow holders (TTL outlived). Fine for Tier 1.

4

Consensus (etcd/ZooKeeper) — 10-200 servers

Linearizable lock state. Sessions with keep-alive. FIFO wait queues. Fencing tokens via revision numbers. ~5-15ms. Breaks at 30K+ ops/sec (consensus ceiling).

5

Two-Tier Custom Service — 50-500 services (Target Design)

Tier 1 (Redis ~1ms) + Tier 2 (Raft ~5-7ms). Thick client library. Purpose-built fencing tokens, FIFO queues, session management. The design we built in this document.

6

Sharded Multi-Cluster — 500+ services

Shard lock namespace across N Raft groups (hash(resource) % N). Linear throughput scaling. Per-shard blast radius. Cross-shard MultiLock not supported.

7

Multi-Region Global — 1000+ services, global

Region-local clusters. Global lock coordinator for cross-region resources. Spanner-style TrueTime for global fencing token ordering (aspirational). Data sovereignty compliance.

📺

References & Videos

Distributed Locking with Redis (Redlock)

Hussein Nasser · 18 min

Distributed Locks — System Design

Arpit Bhayani · 22 min

How to do distributed locking

Martin Kleppmann

Distributed Locks with Redis

Redis Docs

Problem Statement

Three Categories of Locking Need

Tier 1 — Efficiency

Tier 2 — Correctness

The Naive Lock & Its Four Failures

Requirements

Functional Requirements

Non-Functional Requirements

Scale Estimation

Key Insight: Consensus-Bound, Not Storage-Bound

API Design

Fencing Token Contract

Session Model

High-Level Architecture

Deep Dive — Consensus & Fencing

The Redlock Debate (Kleppmann vs Antirez)

Kleppmann's Critique

Our Verdict

Fencing Token Flow

Why Consensus Locks Still Need Fencing

Key Design Decisions & Tradeoffs

CP vs AP Lock Service

CP — Consistency over Availability

AP — Availability over Consistency

Session-Based vs TTL-Only Expiry

Session + Heartbeat

Per-Lock TTL (Redis-style)

Consensus Protocol

Raft

Multi-Paxos

Fencing Enforcement Location

Resource-Side Enforcement

Lock-Service-Side Enforcement

FIFO vs Competitive Lock Acquisition

FIFO Wait Queues

Competitive (Race on Release)

What Can Go Wrong

Lock Holder Dies

Slow Client (GC Pause Outlives Lock) — THE Critical Failure

Network Partition

Leader Failover

Clock Skew

Thundering Herd on Hot Lock Release

Deadlock (Circular Wait)

Interview Tips

Similar Problems

Distributed Job Scheduler

Leader Election

Distributed Transactions (2PC)

Optimistic Concurrency Control

Rate Limiter

Distributed Semaphore

Evolution

In-Process Mutex — 1 server

Database Row Lock — 2-10 servers

Redis SET NX EX — 2-50 servers, <10K locks/sec

Consensus (etcd/ZooKeeper) — 10-200 servers

Two-Tier Custom Service — 50-500 services (Target Design)

Sharded Multi-Cluster — 500+ services

Multi-Region Global — 1000+ services, global

References & Videos

Distributed Job Scheduler

Rate Limiter

Consensus — Paxos & Raft