(a) Request Routing. The client SDK sends a request to a stateless request router (a fleet of lightweight proxies behind a load balancer). The router computes hash(partition_key) -- typically MD5 or a similar uniform hash -- and looks up the consistent hash ring to find the virtual node (vnode) that owns that key range. Each physical storage node owns multiple vnodes (typically 100-200), which ensures even distribution even when node counts change. The router maps the vnode to a physical node's address and forwards the request.
The router caches the partition map in memory for fast lookups. If the map is stale (partition was split, moved, or a new leader was elected), the storage node returns a MOVED-like redirect containing the new owner's address. The router updates its local cache and retries the request transparently -- the client never sees the redirect. This is similar to Redis Cluster's MOVED/ASK protocol.
(b) LSM Storage Engine. Every write goes to two places simultaneously: an append-only Write-Ahead Log (WAL) on disk for durability, and an in-memory sorted buffer called the memtable (typically a red-black tree or skip list). When the memtable reaches a size threshold (~64 MB), it is flushed as an immutable SSTable (Sorted String Table) to disk. Background compaction merges multiple SSTables to reclaim space and reduce read amplification. Each SSTable has a Bloom filter -- a probabilistic data structure that tells you "definitely not here" in O(1), avoiding unnecessary disk reads.
-- Write path (simplified):
1. Append to WAL (sequential disk write, ~0.1 ms)
2. Insert into memtable (in-memory, ~0.01 ms)
3. Ack to client (total: ~0.1 ms)
-- Background:
4. When memtable full -> flush to SSTable on disk
5. Compaction merges SSTables (leveled or size-tiered)
-- Read path (simplified):
1. Check memtable (in-memory, ~0.01 ms)
2. If not found, check each SSTable level:
a. Query Bloom filter -> "definitely not here" (skip) or "maybe here"
b. If maybe here, binary search SSTable index -> read data block
3. Return first match (newest version wins)
-- Bloom filter: ~10 bits per key, < 1% false positive rate
-- Typical read touches 1-2 SSTables thanks to Bloom filters
(c) Replication. Each partition has a leader and 2 followers, each in a different Availability Zone. All writes are directed to the leader, which writes locally and then replicates to both followers. A write is acknowledged when 2 of 3 nodes (quorum) confirm -- this ensures durability even if one entire AZ fails. The math: W=2, R=2, N=3. Since W+R > N (4 > 3), any read quorum overlaps with the write quorum, guaranteeing the latest value is always seen by strong reads.
Strong-consistency reads are served by the leader only -- guaranteed to return the most recent write. Eventually-consistent reads (the default, 50% cheaper in RCU) can be served by any replica. Replication lag is typically under 100 ms but can spike during leader elections or network partitions.
(d) Auto-Scaling. The auto-admin service monitors per-partition throughput via CloudWatch metrics. When a partition exceeds its throughput limit, it is split into two partitions (key range bisected). Cold partitions below a threshold are merged. Adaptive capacity goes further: if partition A uses only 200 of its 1000 WCU, and partition B is throttled at 1000 WCU, the system borrows 800 WCU from A and gives it to B -- no user action needed.
(e) DynamoDB Streams (CDC). Every write to the base table produces a time-ordered stream record containing the old and new item images. Stream records are organized by partition and retained for 24 hours. Consumers (Lambda, Kinesis Data Streams) process records in order per partition key. This enables: materialized view updates, cross-region replication for global tables, event-driven architectures, and real-time analytics pipelines -- all without polling the table.
(f) Global Tables. Multi-region active-active replication. Each region has a full replica of the table. Writes in any region are asynchronously replicated to all other regions via DynamoDB Streams, typically within 1 second. Last-writer-wins conflict resolution using timestamps. Enables low-latency reads in any region and disaster recovery with RPO near zero.
PutItem Write PathMermaid
sequenceDiagram
participant C as Client
participant R as Request Router
participant PM as Partition Map
participant L as Leader Node
participant F1 as Follower AZ-2
participant F2 as Follower AZ-3
C->>R: PutItem(pk, sk, attrs)
R->>PM: hash(pk) -> vnode lookup
PM-->>R: storage_node_leader
R->>L: forward write
L->>L: append WAL + insert memtable
par replicate
L->>F1: replicate write
L->>F2: replicate write
end
F1-->>L: ack
L-->>R: quorum ack (2/3)
R-->>C: 200 OK (1.2 ms)
-- Consistent hash ring example:
-- 256 virtual nodes on a ring [0, 2^128)
-- hash("user_123") = 0x4A3F... -> falls between vnode 47 and vnode 48
-- vnode 48 owned by storage_node_7 (leader)
-- Router forwards PutItem to storage_node_7
-- If node_7 was split, it returns MOVED -> storage_node_12
-- Router updates cache, retries to node_12 transparently
-- Bloom filter per SSTable:
-- Insert: hash item key with k hash functions, set k bits
-- Lookup: check k bits. If ANY bit = 0 -> key NOT in this SSTable (skip)
-- If all bits = 1 -> key MIGHT be here (read the SSTable)
-- False positive rate ~1% with 10 bits/key, 7 hash functions
-- Saves ~99% of unnecessary SSTable reads
Interview answer
"Client calls PutItem. The request router hashes the partition key,
consults a cached consistent hash ring to find the leader storage node.
The leader appends to WAL and inserts into an in-memory memtable --
both under 0.1 ms. It replicates to 2 followers across AZs. Once 2 of 3
ack (quorum), the write is durable and the client gets a response in
single-digit milliseconds. Background compaction flushes memtables to
SSTables and merges them. Reads check the memtable first, then SSTables
with Bloom filters to skip irrelevant files. Hot partitions are auto-split;
cold ones merged. Adaptive capacity borrows unused throughput across
partitions."