Key-Value Store

10 million requests per second, 100 TB of data, single-digit millisecond p99 at any scale. The hard parts: a consistent hash ring that routes every request to the right partition without a central coordinator, an LSM storage engine (memtable + WAL + SSTables + compaction) that turns random writes into sequential I/O, and quorum replication across three AZs so a single node or zone death never loses data and never causes downtime. Amazon DynamoDB, Apache Cassandra, Riak -- same lineage from the 2007 Dynamo paper.

Core: Consistent Hashing + LSM Engine + Quorum Replication~10M req/sec~100 TB dataSingle-digit ms p99Auto-scaling partitions
02

Requirements

A key-value store must support point lookups by primary key in constant time, range queries within a partition, and horizontal scaling by adding partitions as data grows. DynamoDB extends the basic model with transactions, streams, and global replication.

Functional
  • PutItem / GetItem / DeleteItem on a single partition key + optional sort key
  • Query by partition key with sort key range conditions (between, begins_with)
  • Scan full table with filter expressions (for admin/analytics, not hot path)
  • BatchWriteItem for bulk ingestion (up to 25 items per call)
  • TransactWriteItems for ACID transactions across up to 25 items
  • DynamoDB Streams (CDC) -- ordered changelog per partition for downstream consumers
  • Global tables -- multi-region active-active replication
Non-Functional
  • Single-digit millisecond p99 latency for reads and writes
  • Scale to ~10M requests/sec globally without pre-provisioning
  • Durability: no data loss on single-AZ failure (3-AZ replication)
  • Auto-scaling: split hot partitions, merge cold ones, zero downtime
  • Adaptive capacity: borrow unused throughput from cold to hot partitions
  • On-demand mode: zero capacity planning, pay per request
03

Scale Estimation

DynamoDB handles trillions of requests per day across all AWS customers. For a single large table, these are representative numbers. The key insight: scale is achieved by adding partitions, not by making individual nodes faster.

Global throughput
~10M req/sec
across all tables, all regions combined
Total data
~100 TB
per table; some tables exceed 1 PB
p99 latency
< 10 ms
single-digit ms for Get/Put; consistent at any scale
Partitions
~1000s
per table; each handles ~1K WCU / 3K RCU
Item size limit
400 KB
per item; larger objects stored in S3 with pointer
Replication factor
3
leader + 2 followers across 3 AZs
04

API Design

DynamoDB's API is intentionally narrow -- no arbitrary SQL, no joins, no aggregations. Every operation targets a specific partition key, which guarantees the router can send the request to exactly one partition in O(1). This constraint is what enables single-digit ms latency at any scale. The exception is Scan, which touches every partition and should never be used on the hot path.

PUTPutItem(TableName, Item{pk, sk, ...attrs})

Insert or overwrite a single item. Supports ConditionExpression for optimistic locking (e.g., attribute_not_exists(pk)). Returns consumed capacity units.

GETGetItem(TableName, Key{pk, sk}, ConsistentRead?)

Fetch a single item by exact primary key. ConsistentRead=true reads from leader; default eventual read from any replica. Sub-ms for cached items.

DELETEDeleteItem(TableName, Key{pk, sk})

Delete a single item. Supports ConditionExpression to prevent deleting if item has changed. Returns old item if ReturnValues=ALL_OLD.

GETQuery(TableName, KeyCondition: pk = X AND sk BETWEEN a AND b)

Fetch multiple items sharing the same partition key, filtered by sort key range. Returns items sorted by sort key. Paginated via LastEvaluatedKey.

POSTBatchWriteItem(RequestItems: {Table: [{PutRequest|DeleteRequest}, ...]})

Bulk write up to 25 items across tables. Not transactional -- partial failures returned in UnprocessedItems. Client retries with exponential backoff.

POSTTransactWriteItems(TransactItems: [{Put|Delete|Update|ConditionCheck}, ...])

ACID transaction across up to 25 items (can span tables). All-or-nothing. Uses optimistic concurrency control -- aborts on conflict. 2x cost of non-transactional writes.

GETScan(TableName, FilterExpression?)

Full table scan with optional filter. Reads every item, then filters. Expensive -- consumes full table RCU. Use for analytics export, not hot path.

05

Architecture

Three layers: request router (stateless fleet, caches partition map), partition map (consistent hash ring mapping key ranges to virtual nodes to physical storage nodes), storage nodes (leader + 2 replicas each running an LSM tree engine). The request router is the only component the client talks to -- it abstracts the entire distributed system behind a simple API.

A separate control plane runs alongside: the auto-admin service monitors throughput and triggers partition split/merge operations; the global table replicator reads from DynamoDB Streams and writes to remote regions; CloudWatch collects per-partition metrics for adaptive capacity decisions.

Key-Value Store ArchitectureSVG
Client (SDK)PutItem / GetItem Request Routerstateless, cached map Partition Mapconsistent hash ringkey -> virtual node Storage Node (L)leader, AZ-1 Replica 1follower, AZ-2 Replica 2follower, AZ-3 LSM Tree (per partition)WAL -> Memtable -> SSTables + Bloom filters Auto-Adminsplit/merge partitions Global Replicatorcross-region sync DynamoDB StreamsCDC changelog CloudWatchthroughput metrics Lambda / Kinesisstream consumers
Request Flow — Step Through
Client SDK · PutItem / GetItemRequest Router · hash + routePartition Map · consistent hash ringLeader Node · WAL + memtableFollower AZ-2 · replicaFollower AZ-3 · replicaLSM Engine · SSTables + compaction
Click Next Step to walk through the request flow.
06

Deep Dive -- Request Routing + LSM Engine + Replication

(a) Request Routing. The client SDK sends a request to a stateless request router (a fleet of lightweight proxies behind a load balancer). The router computes hash(partition_key) -- typically MD5 or a similar uniform hash -- and looks up the consistent hash ring to find the virtual node (vnode) that owns that key range. Each physical storage node owns multiple vnodes (typically 100-200), which ensures even distribution even when node counts change. The router maps the vnode to a physical node's address and forwards the request.

The router caches the partition map in memory for fast lookups. If the map is stale (partition was split, moved, or a new leader was elected), the storage node returns a MOVED-like redirect containing the new owner's address. The router updates its local cache and retries the request transparently -- the client never sees the redirect. This is similar to Redis Cluster's MOVED/ASK protocol.

(b) LSM Storage Engine. Every write goes to two places simultaneously: an append-only Write-Ahead Log (WAL) on disk for durability, and an in-memory sorted buffer called the memtable (typically a red-black tree or skip list). When the memtable reaches a size threshold (~64 MB), it is flushed as an immutable SSTable (Sorted String Table) to disk. Background compaction merges multiple SSTables to reclaim space and reduce read amplification. Each SSTable has a Bloom filter -- a probabilistic data structure that tells you "definitely not here" in O(1), avoiding unnecessary disk reads.

-- Write path (simplified):
1. Append to WAL (sequential disk write, ~0.1 ms)
2. Insert into memtable (in-memory, ~0.01 ms)
3. Ack to client (total: ~0.1 ms)
-- Background:
4. When memtable full -> flush to SSTable on disk
5. Compaction merges SSTables (leveled or size-tiered)
-- Read path (simplified):
1. Check memtable (in-memory, ~0.01 ms)
2. If not found, check each SSTable level:
   a. Query Bloom filter -> "definitely not here" (skip) or "maybe here"
   b. If maybe here, binary search SSTable index -> read data block
3. Return first match (newest version wins)
-- Bloom filter: ~10 bits per key, < 1% false positive rate
-- Typical read touches 1-2 SSTables thanks to Bloom filters

(c) Replication. Each partition has a leader and 2 followers, each in a different Availability Zone. All writes are directed to the leader, which writes locally and then replicates to both followers. A write is acknowledged when 2 of 3 nodes (quorum) confirm -- this ensures durability even if one entire AZ fails. The math: W=2, R=2, N=3. Since W+R > N (4 > 3), any read quorum overlaps with the write quorum, guaranteeing the latest value is always seen by strong reads.

Strong-consistency reads are served by the leader only -- guaranteed to return the most recent write. Eventually-consistent reads (the default, 50% cheaper in RCU) can be served by any replica. Replication lag is typically under 100 ms but can spike during leader elections or network partitions.

(d) Auto-Scaling. The auto-admin service monitors per-partition throughput via CloudWatch metrics. When a partition exceeds its throughput limit, it is split into two partitions (key range bisected). Cold partitions below a threshold are merged. Adaptive capacity goes further: if partition A uses only 200 of its 1000 WCU, and partition B is throttled at 1000 WCU, the system borrows 800 WCU from A and gives it to B -- no user action needed.

(e) DynamoDB Streams (CDC). Every write to the base table produces a time-ordered stream record containing the old and new item images. Stream records are organized by partition and retained for 24 hours. Consumers (Lambda, Kinesis Data Streams) process records in order per partition key. This enables: materialized view updates, cross-region replication for global tables, event-driven architectures, and real-time analytics pipelines -- all without polling the table.

(f) Global Tables. Multi-region active-active replication. Each region has a full replica of the table. Writes in any region are asynchronously replicated to all other regions via DynamoDB Streams, typically within 1 second. Last-writer-wins conflict resolution using timestamps. Enables low-latency reads in any region and disaster recovery with RPO near zero.

PutItem Write PathMermaid
sequenceDiagram participant C as Client participant R as Request Router participant PM as Partition Map participant L as Leader Node participant F1 as Follower AZ-2 participant F2 as Follower AZ-3 C->>R: PutItem(pk, sk, attrs) R->>PM: hash(pk) -> vnode lookup PM-->>R: storage_node_leader R->>L: forward write L->>L: append WAL + insert memtable par replicate L->>F1: replicate write L->>F2: replicate write end F1-->>L: ack L-->>R: quorum ack (2/3) R-->>C: 200 OK (1.2 ms)
-- Consistent hash ring example:
-- 256 virtual nodes on a ring [0, 2^128)
-- hash("user_123") = 0x4A3F... -> falls between vnode 47 and vnode 48
-- vnode 48 owned by storage_node_7 (leader)
-- Router forwards PutItem to storage_node_7
-- If node_7 was split, it returns MOVED -> storage_node_12
-- Router updates cache, retries to node_12 transparently
-- Bloom filter per SSTable:
-- Insert: hash item key with k hash functions, set k bits
-- Lookup: check k bits. If ANY bit = 0 -> key NOT in this SSTable (skip)
-- If all bits = 1 -> key MIGHT be here (read the SSTable)
-- False positive rate ~1% with 10 bits/key, 7 hash functions
-- Saves ~99% of unnecessary SSTable reads
Interview answer

"Client calls PutItem. The request router hashes the partition key, consults a cached consistent hash ring to find the leader storage node. The leader appends to WAL and inserts into an in-memory memtable -- both under 0.1 ms. It replicates to 2 followers across AZs. Once 2 of 3 ack (quorum), the write is durable and the client gets a response in single-digit milliseconds. Background compaction flushes memtables to SSTables and merges them. Reads check the memtable first, then SSTables with Bloom filters to skip irrelevant files. Hot partitions are auto-split; cold ones merged. Adaptive capacity borrows unused throughput across partitions."

07

Anti-patterns

Most DynamoDB performance problems come from three mistakes: bad partition key design, using the wrong access pattern, or treating it like a relational database.

X
Scan entire table for analytics queries

Scan reads every item, consumes full table RCU, and takes minutes on large tables. You are paying for a key-value store to do a full table scan. Production traffic gets throttled while the scan runs.

Better: Enable DynamoDB Streams and pipe to Kinesis/Lambda for real-time analytics. Use the native export-to-S3 feature for batch analytics -- it reads from snapshots, not live partitions. Query the S3 export with Athena or Spark.
X
Using a date or timestamp as partition key

All writes for "today" hit the same partition. You get a hot partition that throttles while thousands of other partitions sit idle. Classic hotspot pattern that adaptive capacity can only partially mitigate.

Better: Use a high-cardinality partition key (user_id, device_id, order_id). If date is needed, add a random suffix (write sharding): 2024-01-15#03 spreads writes across N shards. Fan out on write, scatter-gather on read.
X
Storing items larger than 400 KB inline

DynamoDB has a hard 400 KB item size limit. Even below that, large items waste RCU/WCU -- billing is per 4 KB for reads and 1 KB for writes. A 400 KB read costs 100 RCU vs 1 RCU for a 4 KB item. It also bloats replication traffic across AZs.

Better: Store the large payload (images, documents, JSON blobs) in S3. Keep a small DynamoDB item with metadata and an S3 URI pointer. Read metadata from DynamoDB (1 RCU), fetch the blob from S3 separately.
08

Tradeoffs & Design Choices

Every design decision in a key-value store is a tradeoff between latency, throughput, cost, and consistency. Here are the key choices and when each side wins.

  • Provisioned vs on-demand capacity. Provisioned: cheaper at steady load, requires capacity planning. On-demand: 6x more expensive per request, but zero planning -- scales instantly. Use provisioned for predictable workloads, on-demand for spiky or new tables.
  • Strong vs eventual consistency reads. Strong reads always go to the leader -- guaranteed latest, but higher latency and 2x RCU cost. Eventual reads can hit any replica -- lower latency, half the cost, but might return stale data (typically < 1 second behind). Default to eventual; use strong only when correctness requires it.
  • Single-table vs multi-table design. Single-table: all entity types in one table using overloaded sort keys (e.g., pk=USER#123, sk=ORDER#456). Enables efficient queries across entity types in one call. Multi-table: simpler schema, easier to reason about, but requires joins at the application layer. Single-table wins for read-heavy access patterns; multi-table wins for simplicity.
  • LSM tree (write-optimized) vs B-tree (read-optimized). LSM: writes are sequential (append WAL + memtable), ~10x faster writes, but reads may check multiple SSTables (read amplification). B-tree: reads are O(log N) from a single structure, but writes require random I/O to update in-place. DynamoDB chose LSM because key-value workloads are often write-heavy and sequential I/O is dramatically faster on SSDs.
  • Partition key design: composite vs simple. Simple PK (just partition key): each item is unique by PK alone. Composite PK (partition key + sort key): enables Query operations over a range of sort keys within one partition. Use composite when you need to model 1-to-many relationships (e.g., user orders: pk=USER#123, sk=ORDER#timestamp).
09

Failure Modes

DynamoDB is designed to degrade gracefully -- throttle rather than crash, redirect rather than error. But every distributed system has failure modes that the client must handle.

T
Hot partition throttling
One partition key receives disproportionate traffic (e.g., viral celebrity profile). Partition exceeds its throughput allocation and requests get throttled with ProvisionedThroughputExceededException.
Mitigation: adaptive capacity borrows from cold partitions. Write sharding for known hot keys. On-demand mode auto-scales per partition. SDK retries with exponential backoff.
L
Leader failure during write
Leader node crashes after replicating to one follower but before responding to the client. Client sees a timeout.
Mitigation: quorum was achieved (2/3 including the follower), so data is durable. A new leader is elected from followers. Client retries idempotently (condition expressions prevent duplicates). No data loss.
G
GSI propagation lag
Global Secondary Indexes are updated asynchronously. Under load, GSI can be seconds behind the base table. Query on GSI returns stale results.
Mitigation: GSI lag is inherent (async by design). For strong consistency, query the base table. Monitor GSI replication lag via CloudWatch. Over-provision GSI write capacity to keep up.
S
Partition split during peak traffic
Auto-admin decides to split a hot partition while it is under heavy load. Brief period of elevated latency and possible throttling during the split window as the partition map is updated and data is redistributed.
Mitigation: split is primarily a metadata operation (update partition map, create new partition with copied key range). Reads/writes continue during split with brief redirects via MOVED responses. For known traffic spikes (product launches, Black Friday), pre-split partitions in advance using provisioned capacity increases.
R
Region-level outage with global tables
An entire AWS region goes offline. Global table replicas in that region become unreachable. Writes that were in-flight to the failed region may not have replicated to other regions yet (async replication).
Mitigation: Route53 health checks detect regional failure and redirect traffic to healthy regions. Global table replicas in other regions continue serving reads and writes independently. RPO is typically < 1 second for async replication. On region recovery, DynamoDB Streams automatically catches up any missed writes.
10

Interview Tips

In a system design interview, the key-value store question tests your understanding of distributed systems fundamentals. Focus on these points to demonstrate depth.

  1. Start with consistent hashing. "hash(partition_key) maps to a virtual node on a consistent hash ring. Adding/removing nodes only remaps 1/N of keys." This is the foundation.
  2. Name the LSM tree explicitly. "Writes append to WAL + memtable, flush to SSTables, background compaction merges them. Bloom filters avoid unnecessary reads." Shows you understand the storage engine, not just the API.
  3. Quorum math: W + R > N. "3 replicas, write quorum of 2, read quorum of 2. W+R=4 > 3, so any read sees the latest write. For eventual consistency, R=1 is enough."
  4. Partition key design is the interview. High cardinality, uniform distribution, matches the access pattern. Bad PK = hot partition = system down. Give a concrete example: user_id good, date bad.
  5. Mention adaptive capacity. This is the detail that shows you have gone beyond textbook knowledge. "DynamoDB borrows unused capacity from cold partitions to hot ones -- no manual intervention."
  6. Contrast with the original Dynamo paper. "Dynamo used vector clocks and sloppy quorums for AP. DynamoDB uses a single leader per partition for CP within a region, which is a simpler model. Global tables add eventual consistency across regions."
  7. Discuss single-table design. "Overload the sort key to store multiple entity types in one table. PK=USER#123 SK=PROFILE for metadata, SK=ORDER#2024-01-15 for orders. One Query returns all data for a user. This is the DynamoDB-native pattern."
12

Evolution

The evolution from a single-node hash table to a globally distributed managed service mirrors the history of distributed systems research -- each step added a capability and introduced new complexity.

1

Single-node hash table

In-memory hash map on one server. O(1) lookups, fast but limited by RAM (typically < 64 GB). No durability -- data lost on process crash. No replication -- single point of failure. Sufficient for caching, but not for a database. This is where memcached lives.

2

Dynamo paper (2007) -- consistent hashing + vector clocks

Amazon's internal Dynamo system. Consistent hashing with virtual nodes distributes keys across a ring of physical nodes. Vector clocks track causality and detect conflicts from concurrent writes. Sloppy quorum + hinted handoff prioritize availability over consistency (AP in CAP). Merkle trees detect replica divergence for anti-entropy repair.

3

DynamoDB -- managed, auto-partitioned

Fully managed service launched in 2012. Automatic partition splitting based on size (10 GB) and throughput (1K WCU / 3K RCU). Dropped vector clocks -- single leader per partition with Paxos-based leader election simplifies consistency model. Provisioned capacity with manual scaling.

4

Adaptive capacity + on-demand mode (2018)

Adaptive capacity intelligently redistributes unused throughput from cold partitions to hot ones in real time. On-demand mode (pay-per-request) eliminates capacity planning entirely -- table scales from zero to millions of requests per second instantly. Auto-scaling with target tracking for provisioned mode.

5

Global tables + PITR + export-to-S3

Multi-region active-active replication via global tables with last-writer-wins conflict resolution. Point-in-time recovery (continuous backups, restore to any second in the last 35 days). Native export to S3 for analytics -- reads from snapshots, zero impact on live table throughput. PartiQL support for SQL-like queries.

Next up