Mock Interview: Design a Distributed Cache — Mock Transcript — Mock Interview

Interviewer

Design a distributed cache system — something like Redis or Memcached, but from scratch. It needs to handle 10 million reads per second with sub-millisecond latency. Walk me through your design.

Candidate

Let me start with the data model and partitioning strategy. We're building a key-value store optimized for read-heavy workloads. To distribute 10M reads/sec across multiple nodes, I'd use hash-slot sharding — similar to Redis Cluster's approach. We divide the keyspace into 16,384 hash slots. Each key is mapped to a slot via CRC16(key) mod 16384. Slots are distributed across N shards. With 16 shards, each handles about 1,024 slots and roughly 625K reads/sec — well within what a single instance can handle with in-memory data structures. The slot-to-shard mapping is stored in a cluster metadata table that every client caches locally. This gives us O(1) routing — the client computes the slot, looks up the shard, and sends the request directly. No proxy layer needed for the hot path.

📝 Annotation

Starting with hash-slot sharding and citing the specific 16,384 number shows Redis Cluster knowledge. Computing the math (625K reads per shard) and explaining client-side routing demonstrates the candidate can reason about capacity from first principles.

Interviewer

What eviction policies would you support, and how do you implement them efficiently?

Candidate

I'd support three eviction policies: LRU (least recently used), LFU (least frequently used), and TTL-based expiration. For LRU, maintaining a perfect LRU across millions of keys is expensive — a doubly-linked list with O(1) operations per access adds 16 bytes of pointer overhead per key and requires a lock on every read. Instead, I'd use an approximated LRU like Redis does: on eviction, sample 5 random keys and evict the one with the oldest access timestamp. This is surprisingly close to true LRU in practice — Redis benchmarks show it achieves 95% of perfect LRU's hit rate. For LFU, I'd use a logarithmic frequency counter (8 bits) combined with a decay factor — the counter probabilistically increments on access and decays over time, so recently-popular keys are favored over historically-popular-but-now-cold keys. TTL expiration uses a combination of lazy deletion (check TTL on access) and an active expiration loop that samples 20 keys with TTL set every 100ms, deleting expired ones, and repeating if more than 25% were expired.

📝 Annotation

Explaining why exact LRU is impractical and pivoting to sampled approximation shows depth. The Redis-style active+lazy TTL expiration with specific sampling parameters (20 keys, 100ms, 25% threshold) demonstrates internals knowledge beyond textbook answers.

Interviewer

How do you handle persistence? Should a cache be persistent?

Candidate

This is a key trade-off. A pure cache doesn't need persistence — if the node restarts, it warms up from the backing store. But in practice, cold-start cache misses can overwhelm the database (thundering herd on restart). So I'd offer two persistence modes. First, RDB snapshots: periodically fork the process, and the child writes a point-in-time snapshot to disk using copy-on-write memory pages. This is efficient — the parent keeps serving reads while the child serializes. Snapshots happen every 5 minutes or after N writes, whichever comes first. The trade-off is you lose up to 5 minutes of data on crash. Second, AOF (append-only file): every write operation is logged to disk. You can tune fsync frequency — every write (safest, but 10x slower), every second (good balance), or never (OS decides). AOF gives better durability but slower writes and larger disk usage. I'd default to RDB for cache use cases and offer AOF for scenarios where the cache doubles as a primary data store, like session storage.

Interviewer

Let's talk about replication. How do you keep replicas in sync?

Candidate

Each shard has one leader and one or two replicas. Replication is asynchronous by default — the leader applies the write, acknowledges the client, then streams the operation to replicas via a replication log. Replicas apply operations in order, maintaining a replication offset that the leader tracks. For initial sync (when a new replica joins or falls too far behind), the leader creates an RDB snapshot, sends it to the replica, then streams all buffered operations since the snapshot began. The replication log is a bounded in-memory buffer (say 64MB) — if a replica disconnects for too long and falls outside this buffer, it needs a full resync. Reads can optionally go to replicas for higher throughput, but clients must tolerate stale reads. For strong consistency needs, we can support a WAIT command: the client blocks until N replicas have acknowledged the write, trading latency for consistency.

📝 Annotation

The bounded replication buffer with full-resync fallback is exactly how Redis replication works. Offering WAIT as an opt-in consistency mechanism shows the candidate understands the consistency-latency spectrum rather than forcing one model.

Interviewer

Here's the hard question: what happens during leader failover? Walk me through the failure scenario and its implications.

Candidate

Leader failover is where things get tricky because of the async replication. Here's the timeline: the leader crashes, and the Sentinel or gossip-based failure detector notices after 3-5 seconds (configurable timeout). The replicas elect a new leader — we use a Raft-lite protocol where the replica with the highest replication offset wins. The new leader starts accepting writes. But here's the critical caveat: since replication was async, the old leader may have acknowledged writes that never reached any replica. Those writes are permanently lost. This is the fundamental trade-off of async replication: we chose availability and low latency over consistency. To mitigate, we can do two things. First, use the WAIT command for writes that cannot be lost — the client only considers the write successful after at least one replica ACKs. Second, when the old leader comes back online, it has divergent data — writes it accepted that the new leader doesn't have. We detect this by comparing replication offsets and truncate the old leader's divergent suffix before it rejoins as a replica.

📝 Annotation

Explicitly calling out data loss during async failover — and not hand-waving it — is what separates strong candidates. The truncation of divergent history when the old leader rejoins is a subtle detail that shows deep distributed systems understanding.

Interviewer

How do you handle cluster resharding — adding or removing nodes?

Candidate

When we add a new shard, we need to migrate some hash slots to it. The process is online — no downtime. We mark slots as "migrating" on the source and "importing" on the destination. During migration, the source checks each key on access: if the key's slot is migrating and the key still exists locally, serve it; if it's already been moved, return a MOVED redirect to the client. A background process iterates through keys in the migrating slots, serializing and sending them to the destination. Once all keys are moved, we update the cluster metadata atomically (the slot-to-shard mapping), and clients refresh their cached routing table on the next MOVED response. The entire process is incremental and doesn't block normal operations. We rate-limit the migration to avoid overwhelming the network — typically 64 keys per batch with a 1ms pause between batches.

Interviewer

One more: how do you prevent cache stampede?

Candidate

Cache stampede happens when a hot key expires and hundreds of concurrent requests all miss the cache simultaneously, overwhelming the backend. Three defenses. First, probabilistic early expiration: instead of expiring at exactly TTL, each access computes now - (TTL - random_delta) > expiry — a small percentage of requests will "expire early" and refresh the cache before the actual TTL, ensuring the key is almost never truly cold. Second, locking: when a cache miss occurs, the first request acquires a distributed lock (SET key:lock NX EX 5) and fetches from the backend; all other requests wait briefly and retry, hitting the now-populated cache. Third, stale-while-revalidate: serve the expired value while one request refreshes it in the background. I'd implement all three and let users choose based on their consistency tolerance.

📝 Annotation

Three distinct stampede prevention strategies with implementation details (probabilistic early expiration, distributed lock, stale-while-revalidate) is comprehensive. Most candidates only mention one of these.

Mock Interview: Design a Distributed Cache — Mock Transcript

Problem statement

Transcript

Key takeaways