Distributed Cache

An in-memory key-value store serving ~1M ops/sec per node with sub-millisecond latency. The hard parts: hash-slot sharding across a cluster of ~100 nodes so any key routes deterministically, eviction policies (LRU/LFU) that keep hot data in RAM while memory stays bounded, and a replication + persistence layer (leader-follower async replication, RDB snapshots + AOF logs) that survives node crashes without losing writes. Redis, Memcached, Dragonfly -- same core pattern, different tradeoffs.

Core: Hash-Slot Sharding + Eviction + Replication~1M ops/sec per nodeSub-ms p99 reads~100 GB RAM per node16384 hash slots
02

Requirements

Functional
  • SET / GET / DEL for arbitrary key-value pairs with optional TTL
  • Atomic counters: INCR / DECR for rate limiting, counters, leaderboards
  • Key expiration via EXPIRE / TTL with lazy + periodic eviction
  • Hash-slot routing: client maps key to shard via CRC16(key) mod 16384
  • Pub/Sub: PUBLISH / SUBSCRIBE for real-time event fan-out
  • Lua scripting for multi-step atomic operations (e.g., DECR-if-positive)
  • Pipeline / batch mode: send N commands in one round-trip
Non-Functional
  • Sub-millisecond p99 read latency on cache hits
  • ~1M ops/sec per node throughput
  • Cluster scales to ~100 nodes with automatic rebalancing
  • ~100 GB usable RAM per node (leave headroom for fragmentation + fork)
  • Survive single-node failure with < 5 s failover via Sentinel or cluster voting
  • Persistence: zero data loss on graceful shutdown; < 1 s of loss on crash (AOF fsync)
03

Scale Estimation

Throughput per node
~1M ops/sec
single-threaded event loop; I/O threads for parsing in Redis 6+
Cluster size
~100 nodes
16384 hash slots distributed across shards
p99 read latency
< 1 ms
in-memory hash table lookup; network is the bottleneck
RAM per node
~100 GB
leave 30% headroom for copy-on-write fork during RDB save
Replication lag
< 1 ms typical
async streaming; can spike during full resync
Failover time
~5 s
Sentinel/cluster detects failure + promotes follower
04

API Design

CMDSET key value [EX seconds] [NX|XX]

Store a key-value pair. EX sets TTL in seconds. NX = set only if not exists (distributed lock). XX = set only if exists (update). Returns OK or nil.

CMDGET key

Retrieve value by key. Returns the stored string or nil. O(1) hash table lookup. Client routes to correct shard via CRC16(key) mod 16384.

CMDDEL key [key ...]

Remove one or more keys. Returns count of keys deleted. For large keys, use UNLINK for async background deletion.

CMDINCR key / EXPIRE key seconds

INCR: atomic increment, returns new value. EXPIRE: set TTL on existing key. Both O(1). Used for rate limiters, counters, session TTLs.

CMDPUBLISH channel message / SUBSCRIBE channel

Pub/Sub fan-out. PUBLISH sends message to all subscribers of a channel. Fire-and-forget -- no persistence, no replay. Use Streams for durable messaging.

CMDPIPELINE: [GET k1, SET k2 v2, INCR k3, ...]

Batch N commands in one round-trip. Client buffers commands, sends all at once, reads all replies. Cuts network overhead from N round-trips to 1.

05

Architecture

Three layers: smart client (caches hash-slot-to-node map, routes directly to the correct shard), shard cluster (each shard = leader + 1-2 replicas, owns a range of 16384 hash slots), and persistence layer (RDB snapshots + AOF log written to disk from each leader). Sentinel or the cluster's gossip protocol handles automatic failover. No single point of failure when configured with replicas.

Redis Cluster ArchitectureSVG
Applicationcache-aside pattern Smart Clientslot map cachedCRC16(key) % 16384 Shard A (Leader)slots 0-5460 Replica Aasync replication Shard B (Leader)slots 5461-10922 Replica Basync replication Shard C (Leader)slots 10923-16383 Replica Casync replication Sentinel / Gossipfailover + voting RDB Snapshotfork + dump to disk AOF Logappend every write Database (Postgres / MySQL) — source of truth; cache-aside: read cache first, miss = load from DB + SET
Request Flow — Step Through
Client App · cache-aside read/writeSmart Client · CRC16 slot routingShard Leader · in-memory hash tableAOF Log · append every writeReplica · async replicationRDB Snapshot · periodic fork+dumpSentinel/Gossip · failover + health
Click Next Step to walk through the request flow.
06

Deep Dive — Sharding, Eviction, Persistence, Replication

(a) Hash-slot sharding. Redis Cluster divides the keyspace into 16384 slots. Each key maps to a slot via CRC16(key) mod 16384. Each shard (leader node) owns a contiguous range of slots. The client caches the slot-to-node map and routes commands directly -- no proxy needed.

On cluster rebalance (adding/removing nodes), slots migrate between shards. During migration, a client may receive a MOVED redirect (slot permanently moved -- update map) or ASK redirect (slot mid-migration -- retry once at new node). Smart clients handle both transparently.

# Hash slot calculation example
key = "user:42"
slot = CRC16("user:42") % 16384  # = 7218
# Slot 7218 falls in Shard B's range (5461-10922)
# Client routes SET/GET directly to Shard B leader

# Hash tags force related keys to same slot
"order:{user42}:history"  → CRC16("user42") % 16384
"cart:{user42}:items"     → CRC16("user42") % 16384
# Both land on the same shard — enables multi-key Lua scripts

(b) Eviction policies. When maxmemory is hit, Redis must evict keys to make room. Policies:

  • volatile-lru -- evict least-recently-used among keys with a TTL set
  • allkeys-lru -- evict LRU across all keys (most common for cache use)
  • allkeys-lfu -- evict least-frequently-used (better for skewed access patterns)

Redis uses approximated LRU: sample N random keys (default 5), evict the one with the oldest last-access timestamp. Not true LRU (no linked list), but very close in practice and O(1) per eviction. Increasing the sample size to 10 gets within 1% of true LRU accuracy.

# redis.conf eviction settings
maxmemory 80gb
maxmemory-policy allkeys-lru
maxmemory-samples 10

(c) Persistence.

  • RDB = point-in-time snapshot. Redis forks, child writes entire dataset to disk. Fast restore, but you lose all writes since the last snapshot on crash.
  • AOF = append-only file. Every write command appended to log. Three fsync modes: always (safe, slow), everysec (lose ~1 s on crash -- recommended), no (OS decides).
  • RDB+AOF hybrid (Redis 4+) = AOF rewrite starts with an RDB prefix (compact binary) followed by AOF tail of recent writes. Fast restore + minimal data loss. This is the default in Redis 7.

AOF rewrite. Over time, AOF grows large. Redis triggers background rewrite: forks a child process that writes a minimal set of commands to recreate the current dataset. Parent buffers new writes during rewrite, appends them after. Result: compact AOF, no downtime.

(d) Replication. Async leader-to-follower streaming. Leader sends write commands to all replicas after executing locally. On leader failure:

  • Sentinel mode: external Sentinel processes monitor the leader. Quorum agrees leader is down, promotes a follower, updates clients.
  • Cluster mode: replicas initiate a Raft-like voting process. Other leaders vote to elect one replica as new leader. No external process needed.

Full resync vs partial resync. When a replica reconnects after a brief disconnect, Redis attempts partial resync using the replication backlog buffer (a ring buffer of recent writes on the leader). If the disconnect was short and the backlog hasn't wrapped, only the missed writes are sent. If the backlog overflowed, a full resync is triggered: leader forks, creates RDB, streams entire dataset. This is expensive -- configure repl-backlog-size to at least 256 MB in production to minimize full resyncs.

Write Path — SET key valueMermaid
sequenceDiagram participant C as Client participant SC as Smart Client participant L as Shard Leader participant AOF as AOF Log participant R as Replica C->>SC: SET user:42 "data" SC->>SC: CRC16("user:42") mod 16384 = slot 7218 SC->>L: route to Shard B (owns 5461-10922) L->>L: write to in-memory hash table L->>AOF: append SET command L-->>C: OK L->>R: async replicate SET command R->>R: apply to memory

Cache-aside pattern (the most common Redis usage):

# Read path (cache-aside)
value = redis.GET(key)
if value is None:           # cache miss
    value = db.query(key)   # load from DB
    redis.SET(key, value, EX=3600)  # populate cache, 1hr TTL
return value

# Write path (invalidate on write)
db.update(key, new_value)   # write to DB first
redis.DEL(key)              # invalidate cache
# Next read will miss → refill from DB with fresh data

Why DEL on write instead of SET? If you SET the cache on write, a race between two concurrent writers can leave stale data in cache permanently. Consider: Writer A reads DB, Writer B reads DB, Writer B updates DB, Writer B writes cache, Writer A (with stale data) overwrites cache. Now cache is stale forever. DEL is safer: worst case, you get one extra cache miss.

Lua scripting for atomic multi-step operations:

-- Rate limiter: sliding window counter
-- KEYS[1] = rate limit key, ARGV[1] = window (sec), ARGV[2] = max requests
local current = redis.call('INCR', KEYS[1])
if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[1])
end
if current > tonumber(ARGV[2]) then
    return 0  -- rate limited
end
return 1  -- allowed

Lua scripts execute atomically on a single shard. No other command runs between lines. This is how Redis replaces what would otherwise require distributed locks or database transactions.

Interview answer

"Redis Cluster shards the keyspace into 16384 hash slots using CRC16. Each shard is a leader-replica pair. The smart client caches the slot map and routes directly -- no proxy overhead. Writes go to the leader, get appended to AOF, and async-replicated to followers. On leader failure, cluster voting promotes a replica in ~5 seconds. Eviction uses approximated LRU: sample 5 keys, evict the stalest. Persistence is RDB+AOF hybrid: RDB prefix for fast restore, AOF tail for minimal data loss."

Anti-patterns

1
Use Redis as the primary database -- "it's fast, who needs Postgres?"

Without AOF fsync=always, a crash loses recent writes. Even with AOF, no ACID transactions, no relational queries. Redis is a cache and data structure server, not a durable primary store.

Better: Cache-aside pattern: DB is source of truth, Redis accelerates reads. Write to DB first, invalidate/update cache.
2
Store 500 GB in a single Redis instance -- "just add more RAM"

RDB fork doubles memory (copy-on-write pages). 500 GB dataset = needs ~1 TB physical RAM. Fork takes minutes. Replication full-sync takes hours. One crash = massive recovery time.

Better: Shard across multiple nodes. Each node holds ~50-100 GB. Fork is fast, replication resync is quick.
3
Run KEYS * in production to find matching keys

KEYS blocks the single-threaded event loop. On a dataset with 10M keys, this blocks all clients for seconds. Latency spike, cascading timeouts, potential outage.

Better: Use SCAN with cursor-based iteration. Non-blocking, returns results incrementally. Also avoid DEBUG SLEEP, FLUSHALL without ASYNC, and large sorted-set operations on hot paths.
4
Cache with no TTL -- "we'll invalidate manually when data changes"

You forget one invalidation path. Stale data lives forever. User sees their old profile photo for weeks. Support tickets pile up.

Better: Always set a TTL as a safety net, even with active invalidation. TTL = upper bound on staleness. Active invalidation = best-effort freshness.
07

Tradeoffs & Design Choices

Every cache design decision is a tradeoff between latency, consistency, durability, and operational complexity. Here are the key ones:

  • RDB vs AOF vs Hybrid. RDB: fast restores, periodic data loss. AOF: minimal loss, slower restores (replay entire log). Hybrid: best of both -- RDB prefix for speed, AOF tail for safety. Use hybrid in production.
  • Cluster mode vs Sentinel. Sentinel: simpler, single logical instance, no sharding. Good to ~100 GB. Cluster: built-in sharding + failover, scales to TB. Choose cluster when data exceeds single-node RAM.
  • Cache-aside vs Write-through. Cache-aside: app manages cache (read: check cache, miss = load DB + SET; write: update DB + invalidate cache). Write-through: cache layer intercepts all writes to DB. Cache-aside is simpler and more common; write-through has fewer stale reads but adds write latency.
  • Redis vs Memcached. Memcached: multi-threaded, simple KV, no persistence, no pub/sub. Redis: single-threaded (I/O threads in 6+), rich data structures, persistence, pub/sub, Lua. Memcached wins on raw multi-core GET throughput; Redis wins on everything else.
  • Approximated LRU vs true LRU. True LRU requires a doubly-linked list (memory overhead per key). Redis samples N keys and evicts the worst. With N=10, accuracy is within 1% of true LRU. Tradeoff: tiny accuracy loss for massive memory savings.
  • Single-threaded vs multi-threaded. Redis is single-threaded for command execution (no locks, no races). Redis 6+ adds I/O threads for network read/write parsing, keeping the core single-threaded. Dragonfly goes fully multi-threaded with shared-nothing per-core architecture -- higher throughput, more complexity.
  • Proxy vs smart client. Proxy (Twemproxy, Envoy): app connects to one endpoint, proxy routes. Simpler client, extra network hop (~0.2 ms). Smart client (Jedis, Lettuce): app caches slot map, routes directly. No extra hop, but client must handle MOVED/ASK. Production Redis Cluster uses smart clients for lowest latency.
  • TTL-based expiration vs explicit invalidation. TTL: simple, eventually consistent, bounded staleness. Explicit invalidation: immediate freshness, but requires tracking all write paths. Best practice: use both -- active invalidation for correctness, TTL as safety net.
08

Failure Modes

Caches fail in subtle ways. Unlike databases, a cache failure doesn't always throw an error -- it degrades silently (stale data, increased latency, DB overload). Understanding these modes is essential.

1
Leader crash -- failover lag causes stale reads
Leader dies. During ~5 s failover, reads from replica may return stale data. Writes are rejected until new leader is elected.
Mitigation: configure WAIT command for critical writes (sync replication to N replicas). Accept eventual consistency for non-critical reads. Client retries on READONLY error.
2
Split-brain during network partition
Network splits cluster. Both sides elect a leader for the same slots. Clients write to both. On partition heal, one side's writes are lost.
Mitigation: configure min-replicas-to-write=1. Leader refuses writes if it can't reach any replica. Prevents divergent writes at cost of availability.
3
Hot key -- single shard overwhelmed
A viral cache key (e.g., celebrity profile) routes to one shard. That shard saturates at 1M ops/sec while others idle.
Mitigation: client-side local cache (Redis 6 tracking mode). Read replicas for hot keys. Key hashing with "{tag}" to spread related keys, but hot single keys need app-level handling.
4
Thundering herd on cache miss
Popular key expires. 10K requests simultaneously miss cache, all hit the DB, DB overloads.
Mitigation: lock-based cache fill (one request fills, others wait). Probabilistic early expiration (refresh before TTL). Background refresh for hot keys.
5
Memory fragmentation after long uptime
Repeated alloc/free of variable-size values fragments jemalloc. RSS grows while used memory stays flat. OOM killer strikes.
Mitigation: Redis 4+ active defragmentation (ACTIVE-DEFRAG). Monitor fragmentation ratio (INFO memory). Schedule restarts if ratio exceeds 1.5.
6
Full resync storm after network blip
Brief network partition between leader and all replicas. Replication buffer overflows. All replicas need full resync simultaneously -- leader forks N times, OOM.
Mitigation: increase repl-backlog-size (default 1 MB is too small for production). Stagger replica reconnection. Monitor repl_backlog_active.
09

Interview Tips

  1. Start with the access pattern. "Read-heavy workload, cache-aside pattern, Redis as L2 between app and DB." This frames the problem correctly.
  2. Name the sharding scheme. "CRC16 mod 16384 hash slots, client-side routing, MOVED/ASK redirects." Shows you know Redis internals, not just the API.
  3. Explain eviction before they ask. "allkeys-lru with sample size 10. Approximated LRU, not true LRU -- the O(1) tradeoff." This is the detail that separates senior from mid-level.
  4. Persistence is not optional. "RDB+AOF hybrid. RDB prefix for fast restart, AOF tail for sub-second data loss." Don't hand-wave "we'll use persistence" -- name the mode.
  5. Address the thundering herd. Without prompting, mention cache stampede mitigation: lock-based fill, probabilistic early expiration, or background refresh. Shows production experience.
  6. Know when Redis is wrong. "Above 100 GB per node, consider sharding more aggressively or using a disk-backed store. Redis is not a database replacement." Showing boundaries earns trust.
11

Evolution

How cache architecture evolves as scale grows from a single app server to millions of requests per second across a global fleet.

1

Single Memcached instance

Simple KV cache. No persistence, no replication. Works until the node dies and your cache goes cold. Restart = cache stampede on DB. Fine for prototypes and small apps with < 10K RPM.

2

Redis standalone + Sentinel

Persistence (RDB/AOF) survives restarts. Sentinel auto-promotes replica on leader failure -- no manual intervention. Rich data structures (sorted sets, hashes, streams) enable complex patterns beyond simple KV. Still limited to one node's RAM (~100 GB practical max).

3

Redis Cluster with hash slots

16384 slots sharded across N leaders with replicas. Smart clients route directly using cached slot map. Cluster gossip protocol for health + Raft-like voting for failover. Scales linearly to TB of data and millions of ops/sec. This is the standard for mid-to-large companies.

4

Client-side caching (Redis 6 tracking)

Client caches hot keys locally in process memory. Redis server tracks which clients cached which keys. On key modification, server sends invalidation message to all tracking clients. Eliminates network round-trip for hot reads entirely. L1 = local process memory (~1 ms savings), L2 = Redis.

5

Tiered cache: L1 local + L2 Redis + L3 DB

Process-local cache (Caffeine/Guava) for ultra-hot keys. Redis cluster for warm keys. Database for cold reads. Each tier absorbs misses from the tier above. Total hit rate approaches 99.9%. Invalidation via Redis pub/sub or Redis 6 client tracking notifications.

Next up