Key-Value Store

10 million requests per second, 100 TB of data, single-digit millisecond p99 at any scale. The hard parts: a consistent hash ring that routes every request to the right partition without a central coordinator, an LSM storage engine (memtable + WAL + SSTables + compaction) that turns random writes into sequential I/O, and quorum replication across three AZs so a single node or zone death never loses data and never causes downtime. Amazon DynamoDB, Apache Cassandra, Riak -- same lineage from the 2007 Dynamo paper.

Core: Consistent Hashing + LSM Engine + Quorum Replication~10M req/sec~100 TB dataSingle-digit ms p99Auto-scaling partitions

Requirements

A key-value store must support point lookups by primary key in constant time, range queries within a partition, and horizontal scaling by adding partitions as data grows. DynamoDB extends the basic model with transactions, streams, and global replication.

Functional

PutItem / GetItem / DeleteItem on a single partition key + optional sort key
Query by partition key with sort key range conditions (between, begins_with)
Scan full table with filter expressions (for admin/analytics, not hot path)
BatchWriteItem for bulk ingestion (up to 25 items per call)
TransactWriteItems for ACID transactions across up to 25 items
DynamoDB Streams (CDC) -- ordered changelog per partition for downstream consumers
Global tables -- multi-region active-active replication

Non-Functional

Single-digit millisecond p99 latency for reads and writes
Scale to ~10M requests/sec globally without pre-provisioning
Durability: no data loss on single-AZ failure (3-AZ replication)
Auto-scaling: split hot partitions, merge cold ones, zero downtime
Adaptive capacity: borrow unused throughput from cold to hot partitions
On-demand mode: zero capacity planning, pay per request

Scale Estimation

DynamoDB handles trillions of requests per day across all AWS customers. For a single large table, these are representative numbers. The key insight: scale is achieved by adding partitions, not by making individual nodes faster.

Global throughput

~10M req/sec

across all tables, all regions combined

Total data

~100 TB

per table; some tables exceed 1 PB

p99 latency

< 10 ms

single-digit ms for Get/Put; consistent at any scale

Partitions

~1000s

per table; each handles ~1K WCU / 3K RCU

Item size limit

400 KB

per item; larger objects stored in S3 with pointer

Replication factor

leader + 2 followers across 3 AZs

API Design

DynamoDB's API is intentionally narrow -- no arbitrary SQL, no joins, no aggregations. Every operation targets a specific partition key, which guarantees the router can send the request to exactly one partition in O(1). This constraint is what enables single-digit ms latency at any scale. The exception is Scan, which touches every partition and should never be used on the hot path.

PUTPutItem(TableName, Item{pk, sk, ...attrs})

Insert or overwrite a single item. Supports ConditionExpression for optimistic locking (e.g., attribute_not_exists(pk)). Returns consumed capacity units.

GETGetItem(TableName, Key{pk, sk}, ConsistentRead?)

Fetch a single item by exact primary key. ConsistentRead=true reads from leader; default eventual read from any replica. Sub-ms for cached items.

DELETEDeleteItem(TableName, Key{pk, sk})

Delete a single item. Supports ConditionExpression to prevent deleting if item has changed. Returns old item if ReturnValues=ALL_OLD.

GETQuery(TableName, KeyCondition: pk = X AND sk BETWEEN a AND b)

Fetch multiple items sharing the same partition key, filtered by sort key range. Returns items sorted by sort key. Paginated via LastEvaluatedKey.

POSTBatchWriteItem(RequestItems: {Table: [{PutRequest|DeleteRequest}, ...]})

Bulk write up to 25 items across tables. Not transactional -- partial failures returned in UnprocessedItems. Client retries with exponential backoff.

POSTTransactWriteItems(TransactItems: [{Put|Delete|Update|ConditionCheck}, ...])

ACID transaction across up to 25 items (can span tables). All-or-nothing. Uses optimistic concurrency control -- aborts on conflict. 2x cost of non-transactional writes.

GETScan(TableName, FilterExpression?)

Full table scan with optional filter. Reads every item, then filters. Expensive -- consumes full table RCU. Use for analytics export, not hot path.

Architecture

Three layers: request router (stateless fleet, caches partition map), partition map (consistent hash ring mapping key ranges to virtual nodes to physical storage nodes), storage nodes (leader + 2 replicas each running an LSM tree engine). The request router is the only component the client talks to -- it abstracts the entire distributed system behind a simple API.

A separate control plane runs alongside: the auto-admin service monitors throughput and triggers partition split/merge operations; the global table replicator reads from DynamoDB Streams and writes to remote regions; CloudWatch collects per-partition metrics for adaptive capacity decisions.

Key-Value Store ArchitectureSVG

Request Flow — Step Through

Client SDK · PutItem / GetItem→Request Router · hash + route→Partition Map · consistent hash ring→Leader Node · WAL + memtable→Follower AZ-2 · replica→Follower AZ-3 · replica→LSM Engine · SSTables + compaction

Click Next Step to walk through the request flow.

Deep Dive -- Request Routing + LSM Engine + Replication

(a) Request Routing. The client SDK sends a request to a stateless request router (a fleet of lightweight proxies behind a load balancer). The router computes hash(partition_key) -- typically MD5 or a similar uniform hash -- and looks up the consistent hash ring to find the virtual node (vnode) that owns that key range. Each physical storage node owns multiple vnodes (typically 100-200), which ensures even distribution even when node counts change. The router maps the vnode to a physical node's address and forwards the request.

The router caches the partition map in memory for fast lookups. If the map is stale (partition was split, moved, or a new leader was elected), the storage node returns a MOVED-like redirect containing the new owner's address. The router updates its local cache and retries the request transparently -- the client never sees the redirect. This is similar to Redis Cluster's MOVED/ASK protocol.

(b) LSM Storage Engine. Every write goes to two places simultaneously: an append-only Write-Ahead Log (WAL) on disk for durability, and an in-memory sorted buffer called the memtable (typically a red-black tree or skip list). When the memtable reaches a size threshold (~64 MB), it is flushed as an immutable SSTable (Sorted String Table) to disk. Background compaction merges multiple SSTables to reclaim space and reduce read amplification. Each SSTable has a Bloom filter -- a probabilistic data structure that tells you "definitely not here" in O(1), avoiding unnecessary disk reads.

-- Write path (simplified):
1. Append to WAL (sequential disk write, ~0.1 ms)
2. Insert into memtable (in-memory, ~0.01 ms)
3. Ack to client (total: ~0.1 ms)
-- Background:
4. When memtable full -> flush to SSTable on disk
5. Compaction merges SSTables (leveled or size-tiered)

-- Read path (simplified):
1. Check memtable (in-memory, ~0.01 ms)
2. If not found, check each SSTable level:
   a. Query Bloom filter -> "definitely not here" (skip) or "maybe here"
   b. If maybe here, binary search SSTable index -> read data block
3. Return first match (newest version wins)
-- Bloom filter: ~10 bits per key, < 1% false positive rate
-- Typical read touches 1-2 SSTables thanks to Bloom filters

(c) Replication. Each partition has a leader and 2 followers, each in a different Availability Zone. All writes are directed to the leader, which writes locally and then replicates to both followers. A write is acknowledged when 2 of 3 nodes (quorum) confirm -- this ensures durability even if one entire AZ fails. The math: W=2, R=2, N=3. Since W+R > N (4 > 3), any read quorum overlaps with the write quorum, guaranteeing the latest value is always seen by strong reads.

Strong-consistency reads are served by the leader only -- guaranteed to return the most recent write. Eventually-consistent reads (the default, 50% cheaper in RCU) can be served by any replica. Replication lag is typically under 100 ms but can spike during leader elections or network partitions.

(d) Auto-Scaling. The auto-admin service monitors per-partition throughput via CloudWatch metrics. When a partition exceeds its throughput limit, it is split into two partitions (key range bisected). Cold partitions below a threshold are merged. Adaptive capacity goes further: if partition A uses only 200 of its 1000 WCU, and partition B is throttled at 1000 WCU, the system borrows 800 WCU from A and gives it to B -- no user action needed.

(e) DynamoDB Streams (CDC). Every write to the base table produces a time-ordered stream record containing the old and new item images. Stream records are organized by partition and retained for 24 hours. Consumers (Lambda, Kinesis Data Streams) process records in order per partition key. This enables: materialized view updates, cross-region replication for global tables, event-driven architectures, and real-time analytics pipelines -- all without polling the table.

(f) Global Tables. Multi-region active-active replication. Each region has a full replica of the table. Writes in any region are asynchronously replicated to all other regions via DynamoDB Streams, typically within 1 second. Last-writer-wins conflict resolution using timestamps. Enables low-latency reads in any region and disaster recovery with RPO near zero.

PutItem Write PathMermaid

sequenceDiagram participant C as Client participant R as Request Router participant PM as Partition Map participant L as Leader Node participant F1 as Follower AZ-2 participant F2 as Follower AZ-3 C->>R: PutItem(pk, sk, attrs) R->>PM: hash(pk) -> vnode lookup PM-->>R: storage_node_leader R->>L: forward write L->>L: append WAL + insert memtable par replicate L->>F1: replicate write L->>F2: replicate write end F1-->>L: ack L-->>R: quorum ack (2/3) R-->>C: 200 OK (1.2 ms)

-- Consistent hash ring example:
-- 256 virtual nodes on a ring [0, 2^128)
-- hash("user_123") = 0x4A3F... -> falls between vnode 47 and vnode 48
-- vnode 48 owned by storage_node_7 (leader)
-- Router forwards PutItem to storage_node_7
-- If node_7 was split, it returns MOVED -> storage_node_12
-- Router updates cache, retries to node_12 transparently

-- Bloom filter per SSTable:
-- Insert: hash item key with k hash functions, set k bits
-- Lookup: check k bits. If ANY bit = 0 -> key NOT in this SSTable (skip)
-- If all bits = 1 -> key MIGHT be here (read the SSTable)
-- False positive rate ~1% with 10 bits/key, 7 hash functions
-- Saves ~99% of unnecessary SSTable reads

Interview answer

"Client calls PutItem. The request router hashes the partition key, consults a cached consistent hash ring to find the leader storage node. The leader appends to WAL and inserts into an in-memory memtable -- both under 0.1 ms. It replicates to 2 followers across AZs. Once 2 of 3 ack (quorum), the write is durable and the client gets a response in single-digit milliseconds. Background compaction flushes memtables to SSTables and merges them. Reads check the memtable first, then SSTables with Bloom filters to skip irrelevant files. Hot partitions are auto-split; cold ones merged. Adaptive capacity borrows unused throughput across partitions."

Anti-patterns

Most DynamoDB performance problems come from three mistakes: bad partition key design, using the wrong access pattern, or treating it like a relational database.

Scan entire table for analytics queries

Scan reads every item, consumes full table RCU, and takes minutes on large tables. You are paying for a key-value store to do a full table scan. Production traffic gets throttled while the scan runs.

Better: Enable DynamoDB Streams and pipe to Kinesis/Lambda for real-time analytics. Use the native export-to-S3 feature for batch analytics -- it reads from snapshots, not live partitions. Query the S3 export with Athena or Spark.

Using a date or timestamp as partition key

All writes for "today" hit the same partition. You get a hot partition that throttles while thousands of other partitions sit idle. Classic hotspot pattern that adaptive capacity can only partially mitigate.

Better: Use a high-cardinality partition key (user_id, device_id, order_id). If date is needed, add a random suffix (write sharding): 2024-01-15#03 spreads writes across N shards. Fan out on write, scatter-gather on read.

Storing items larger than 400 KB inline

DynamoDB has a hard 400 KB item size limit. Even below that, large items waste RCU/WCU -- billing is per 4 KB for reads and 1 KB for writes. A 400 KB read costs 100 RCU vs 1 RCU for a 4 KB item. It also bloats replication traffic across AZs.

Better: Store the large payload (images, documents, JSON blobs) in S3. Keep a small DynamoDB item with metadata and an S3 URI pointer. Read metadata from DynamoDB (1 RCU), fetch the blob from S3 separately.

Tradeoffs & Design Choices

Every design decision in a key-value store is a tradeoff between latency, throughput, cost, and consistency. Here are the key choices and when each side wins.

Provisioned vs on-demand capacity. Provisioned: cheaper at steady load, requires capacity planning. On-demand: 6x more expensive per request, but zero planning -- scales instantly. Use provisioned for predictable workloads, on-demand for spiky or new tables.
Strong vs eventual consistency reads. Strong reads always go to the leader -- guaranteed latest, but higher latency and 2x RCU cost. Eventual reads can hit any replica -- lower latency, half the cost, but might return stale data (typically < 1 second behind). Default to eventual; use strong only when correctness requires it.
Single-table vs multi-table design. Single-table: all entity types in one table using overloaded sort keys (e.g., pk=USER#123, sk=ORDER#456). Enables efficient queries across entity types in one call. Multi-table: simpler schema, easier to reason about, but requires joins at the application layer. Single-table wins for read-heavy access patterns; multi-table wins for simplicity.
LSM tree (write-optimized) vs B-tree (read-optimized). LSM: writes are sequential (append WAL + memtable), ~10x faster writes, but reads may check multiple SSTables (read amplification). B-tree: reads are O(log N) from a single structure, but writes require random I/O to update in-place. DynamoDB chose LSM because key-value workloads are often write-heavy and sequential I/O is dramatically faster on SSDs.
Partition key design: composite vs simple. Simple PK (just partition key): each item is unique by PK alone. Composite PK (partition key + sort key): enables Query operations over a range of sort keys within one partition. Use composite when you need to model 1-to-many relationships (e.g., user orders: pk=USER#123, sk=ORDER#timestamp).

Failure Modes

DynamoDB is designed to degrade gracefully -- throttle rather than crash, redirect rather than error. But every distributed system has failure modes that the client must handle.

Hot partition throttling

One partition key receives disproportionate traffic (e.g., viral celebrity profile). Partition exceeds its throughput allocation and requests get throttled with ProvisionedThroughputExceededException.

Mitigation: adaptive capacity borrows from cold partitions. Write sharding for known hot keys. On-demand mode auto-scales per partition. SDK retries with exponential backoff.

Leader failure during write

Leader node crashes after replicating to one follower but before responding to the client. Client sees a timeout.

Mitigation: quorum was achieved (2/3 including the follower), so data is durable. A new leader is elected from followers. Client retries idempotently (condition expressions prevent duplicates). No data loss.

GSI propagation lag

Global Secondary Indexes are updated asynchronously. Under load, GSI can be seconds behind the base table. Query on GSI returns stale results.

Mitigation: GSI lag is inherent (async by design). For strong consistency, query the base table. Monitor GSI replication lag via CloudWatch. Over-provision GSI write capacity to keep up.

Partition split during peak traffic

Auto-admin decides to split a hot partition while it is under heavy load. Brief period of elevated latency and possible throttling during the split window as the partition map is updated and data is redistributed.

Mitigation: split is primarily a metadata operation (update partition map, create new partition with copied key range). Reads/writes continue during split with brief redirects via MOVED responses. For known traffic spikes (product launches, Black Friday), pre-split partitions in advance using provisioned capacity increases.

Region-level outage with global tables

An entire AWS region goes offline. Global table replicas in that region become unreachable. Writes that were in-flight to the failed region may not have replicated to other regions yet (async replication).

Mitigation: Route53 health checks detect regional failure and redirect traffic to healthy regions. Global table replicas in other regions continue serving reads and writes independently. RPO is typically < 1 second for async replication. On region recovery, DynamoDB Streams automatically catches up any missed writes.

Interview Tips

In a system design interview, the key-value store question tests your understanding of distributed systems fundamentals. Focus on these points to demonstrate depth.

Start with consistent hashing. "hash(partition_key) maps to a virtual node on a consistent hash ring. Adding/removing nodes only remaps 1/N of keys." This is the foundation.
Name the LSM tree explicitly. "Writes append to WAL + memtable, flush to SSTables, background compaction merges them. Bloom filters avoid unnecessary reads." Shows you understand the storage engine, not just the API.
Quorum math: W + R > N. "3 replicas, write quorum of 2, read quorum of 2. W+R=4 > 3, so any read sees the latest write. For eventual consistency, R=1 is enough."
Partition key design is the interview. High cardinality, uniform distribution, matches the access pattern. Bad PK = hot partition = system down. Give a concrete example: user_id good, date bad.
Mention adaptive capacity. This is the detail that shows you have gone beyond textbook knowledge. "DynamoDB borrows unused capacity from cold partitions to hot ones -- no manual intervention."
Contrast with the original Dynamo paper. "Dynamo used vector clocks and sloppy quorums for AP. DynamoDB uses a single leader per partition for CP within a region, which is a simpler model. Global tables add eventual consistency across regions."
Discuss single-table design. "Overload the sort key to store multiple entity types in one table. PK=USER#123 SK=PROFILE for metadata, SK=ORDER#2024-01-15 for orders. One Query returns all data for a user. This is the DynamoDB-native pattern."

Evolution

The evolution from a single-node hash table to a globally distributed managed service mirrors the history of distributed systems research -- each step added a capability and introduced new complexity.

Single-node hash table

In-memory hash map on one server. O(1) lookups, fast but limited by RAM (typically < 64 GB). No durability -- data lost on process crash. No replication -- single point of failure. Sufficient for caching, but not for a database. This is where memcached lives.

Dynamo paper (2007) -- consistent hashing + vector clocks

Amazon's internal Dynamo system. Consistent hashing with virtual nodes distributes keys across a ring of physical nodes. Vector clocks track causality and detect conflicts from concurrent writes. Sloppy quorum + hinted handoff prioritize availability over consistency (AP in CAP). Merkle trees detect replica divergence for anti-entropy repair.

DynamoDB -- managed, auto-partitioned

Fully managed service launched in 2012. Automatic partition splitting based on size (10 GB) and throughput (1K WCU / 3K RCU). Dropped vector clocks -- single leader per partition with Paxos-based leader election simplifies consistency model. Provisioned capacity with manual scaling.

Adaptive capacity + on-demand mode (2018)

Adaptive capacity intelligently redistributes unused throughput from cold partitions to hot ones in real time. On-demand mode (pay-per-request) eliminates capacity planning entirely -- table scales from zero to millions of requests per second instantly. Auto-scaling with target tracking for provisioned mode.

Global tables + PITR + export-to-S3

Multi-region active-active replication via global tables with last-writer-wins conflict resolution. Point-in-time recovery (continuous backups, restore to any second in the last 35 days). Native export to S3 for analytics -- reads from snapshots, zero impact on live table throughput. PartiQL support for SQL-like queries.

📺

References & Videos

Design a Key-Value Store

ByteByteGo · 12 min

Amazon DynamoDB Deep Dive

AWS re:Invent · 45 min

Design a Key-Value Store — LSM Trees + Compaction

Jordan Has No Life · 35 min

Dynamo: Amazon's Key-Value Store

Werner Vogels

Design a Key-Value Store

AlgoMaster

Next up

PROBLEM

S3 Object Storage

Blob storage with eventual consistency -- different data model, similar partitioning

Read →

PROBLEM

Distributed Cache

In-memory key-value (Redis/Memcached) -- same hashing, no durability

Read →

Key-Value Store

Requirements

Scale Estimation

API Design

Architecture

Deep Dive -- Request Routing + LSM Engine + Replication

Anti-patterns

Tradeoffs & Design Choices

Failure Modes

Interview Tips

Similar Problems

S3 Object Storage

Distributed Cache

Distributed Locking

Sharding (Concept)

Consistent Hashing (Concept)

Evolution

Single-node hash table

Dynamo paper (2007) -- consistent hashing + vector clocks

DynamoDB -- managed, auto-partitioned

Adaptive capacity + on-demand mode (2018)

Global tables + PITR + export-to-S3

References & Videos

S3 Object Storage

Distributed Cache