Amazon S3

A bucket-and-key object store with read-your-writes consistency, ~11 nines of durability, and exabyte-scale capacity. The hard parts: a key → bytes service that doesn't fall over at millions of requests per second; erasure-coded durability cheaper than 3× replication but rebuilds gracefully; and a partition-key-hot-spot story — what happens when everyone writes to keys starting with 2026-04-12/. S3 stores hundreds of trillions of objects and serves tens of millions of requests per second.

⚡ Core: Durability + Hot-Partition Handling11 nines durability100T+ objects~100M req/sec peakExabyte storage

Requirements

Functional

PUT / GET / DELETE on (bucket, key) tuples; opaque byte contents
Object size from 0 bytes to 5 TB; multipart upload for large ones
Bucket ACLs + per-object ACLs; IAM integration; pre-signed URLs
Versioning per bucket (opt-in): preserve old object versions on overwrite
List objects by prefix (for file-system-like browse)
Lifecycle: transition to Glacier after N days; auto-delete after M
Events: emit notifications (SNS/SQS/Lambda) on put/delete

Non-Functional

11 nines durability (99.999999999%) — one object in 10 billion lost per year
Read-after-write consistency for new objects (strong, since Dec 2020)
Scale individual buckets to ~5,500 PUT / 3,500 GET per partition per second; horizontally unbounded aggregate
Available across an AWS region with AZ redundancy (3 AZs minimum)
Integrity verified via MD5/SHA on every byte
Pay-for-what-you-store: cold tiers 10× cheaper than hot

Scale Estimation

Objects stored

~280T

AWS disclosed: 280 trillion objects as of 2023; growing double-digit % YoY

Request peak

~100M req/sec

aggregate across all customers; individual buckets throttle at ~5500 PUT/sec per prefix

Durability target

11 nines

achieved via Reed-Solomon erasure coding + cross-AZ placement

Storage overhead

~1.4×

RS(10,4) stores 14 shards for every 10 data shards — cheaper than 3× replication

Partition key space

1024

bucket split into ~1024 index partitions; auto-splits on load

Disk rebuild time

~hours

failed disk rebuild = read surviving shards, XOR to recover; bandwidth-bounded

API Design

PUT/{bucket}/{key}

Upload an object. Body = bytes. Headers: Content-MD5 (for verification), Storage-Class (STANDARD / IA / GLACIER). Returns ETag (MD5 of object). Idempotent on identical bytes.

GET/{bucket}/{key}

Fetch object. Supports Range headers for partial reads (Range: bytes=0-1048575). Returns object bytes + metadata headers (x-amz-meta-*).

POST/{bucket}/{key}?uploads

Initiate multipart upload. Returns UploadId. For objects > 100 MB or resumable uploads. Client then uploads parts in parallel.

PUT/{bucket}/{key}?partNumber=N&uploadId=ID

Upload part (minimum 5 MB except last). Returns ETag for that part. Parts can be uploaded in parallel; max 10,000 parts per object.

POST/{bucket}/{key}?uploadId=ID

Complete multipart. Body = ordered list of (partNumber, ETag). Server validates all parts exist; stitches into single object; returns final object ETag.

GET/{bucket}?prefix=folder/&delimiter=/&max-keys=1000

List objects. Cursor-based pagination. Heavy use case — dominates hot-partition problems.

DELETE/{bucket}/{key}

Delete object. With versioning enabled, creates a delete marker; actual bytes retained until version is explicitly deleted.

Architecture

Two distinct planes. The index plane (metadata: (bucket, key) → physical locations) is a sharded, highly-available KV service. The data plane (actual bytes, erasure-coded across disks and AZs) is the storage fleet. A request front-end authenticates + routes; a background garbage collector reclaims deleted bytes after safe intervals.

S3 Request Flow + Storage Layout SVG

Request Flow — Step Through

Client · PUT /bucket/key→Front-end · auth + route→Index alloc · object_id + partition→RS encoder · 10 data + 4 parity→AZ placement · cross 3+ AZs→Durability ack · all 14 shards→Commit + ETag · metadata committed

Click Next Step to walk through the request flow.

Deep Dive — Durability via Erasure Coding + Hot Partitions

S3 promises "eleven nines" of durability. That's not a marketing line — it's a budget. 99.999999999% means if you store 10B objects you expect to lose ~1 per year. Achieving this cheaply is the central engineering challenge.

Triple replication is too expensive. Storing 3 copies gives ~6 nines and costs 3× storage. For exabyte scale, that's billions of dollars of overhead.

Reed-Solomon erasure coding. Split each object into 10 data shards; compute 4 parity shards. Any 10 of 14 can reconstruct the whole object. Losing 4 independent disks simultaneously (astronomically unlikely if spread across AZs) is required to lose data. Storage overhead: 1.4× instead of 3×. Durability: ~11 nines when placed across multiple AZs.

Write Path — Erasure-coded placement Mermaid

sequenceDiagram participant C as Client participant FE as Front-end participant IX as Index svc participant ENC as RS Encoder participant S as Storage fleet C->>FE: PUT /bucket/key (bytes) FE->>IX: allocate object_id IX-->>FE: object_id + partition hint FE->>ENC: split + encode bytes ENC->>ENC: produce 10 data + 4 parity shards par write across AZs ENC->>S: shard 1→10 (AZ a/b/c/d) ENC->>S: parity 1→4 end S-->>FE: all 14 acked FE->>IX: commit {object_id, shard_locations, etag} IX-->>FE: committed FE-->>C: 200 OK + ETag

Hot partition problem. The metadata index is sharded. Historically S3 sharded by a hash of the full key, but ALSO used the key prefix for list performance. A naming pattern like 2026/04/12/log-XXX.json meant all today's writes hit the same few metadata partitions. Throughput caps at ~5500 PUT/sec per prefix; worst-case customers saw throttling.

S3 solved this in two steps:

Hot partition auto-split. Metadata service detects rising QPS on a partition and automatically splits it into two, rebalancing keys. Background process; live traffic largely unaffected.
Auto-partitioning (2018 launch). The service learned to auto-shard by key prefix entropy. Today's customers don't need to add random hash prefixes to their key names — S3 handles it. But for extreme rates, the old advice still helps: ${random_hex}/2026/04/12/log.json spreads load evenly.

Rebuild storm. When a disk fails, the system re-reads 10 of the 13 surviving shards per affected object to rebuild the lost shard onto a new disk. That's 10× traffic spike on the surviving disks during rebuild. Mitigations: prioritize low-redundancy objects, throttle rebuild bandwidth to prevent starving live traffic, and keep spare capacity so even a major failure doesn't push the system into the red zone.

Multipart upload for big objects. A 5 TB object can't reasonably be one PUT. Multipart lets you:

Upload parts in parallel (higher throughput).
Resume on failure (only re-upload the failed part).
Abort to reclaim space (lifecycle policies clean stale uploads).

Each part is stored as its own erasure-coded blob; the final CompleteMultipartUpload just writes a manifest. No byte-level stitching required.

Interview answer

"Two planes — index (metadata) and data (bytes). Data plane uses Reed-Solomon (10, 4) erasure coding spread across 3+ AZs: 14 shards, any 10 rebuild the object, 1.4× storage overhead, 11 nines durability. Index plane is sharded by bucket key; auto-splits hot partitions. Front-end fleet authenticates + routes. Multipart upload splits big objects into parallel, resumable parts. Background control plane handles GC, lifecycle tiering, event emission, and cross-region replication. The hard parts are durability math and hot-partition auto-handling — not the API surface."

Tradeoffs & Design Choices

Erasure coding vs replication. EC saves storage but costs more CPU (encode/decode on every read/write) and higher recovery bandwidth. Worth it only at large scale. Small-scale systems stick with replication.
Strong vs eventual consistency. S3 became strong read-after-write in 2020. Before: writes could take minutes to propagate to all index replicas, leading to "missing object" errors right after upload. Now: linearizable on new objects, eventual only on list operations. Users pay nothing extra — AWS absorbed the engineering cost.
Storage tiering. Hot (SSD), warm (HDD), cold (Glacier = tape / slow disk). 10× cost difference between tiers. Lifecycle policies auto-migrate; reads from cold tiers take minutes-to-hours. Important to expose as an explicit product rather than hiding.
Per-object metadata overhead. Every object has a few KB of metadata (ACL, storage class, lifecycle markers). At 280T objects, that's ~1+ PB of metadata alone — not ignorable.
Eventual consistency on list. List operations scan metadata partitions and may miss just-written objects for seconds. Users who need "list includes everything I wrote" either wait a few seconds or call GetObject directly.

Failure Modes

💾

Concurrent disk failures during rebuild

One disk fails; during rebuild a 2nd in the same stripe fails; then a 3rd. With RS(10,4) we can tolerate 4 losses; beyond that → data loss.

→ Mitigation: (a) spread shards across AZs so correlated power/network failures don't take out multiple shards; (b) prioritize rebuilds for stripes with fewer surviving shards; (c) cap disk fleet utilization so rebuild has headroom.

🔥

Hot prefix throttling

Customer logs keyed 2026/04/12/evt-000001.json all hit the same metadata partition; 5500 PUT/sec ceiling hit; customer sees 503s.

→ Mitigation: auto-split triggered by sustained high QPS; split takes a few minutes. Customer guidance: prefix with a hash ff23/2026/04/12/evt-... for immediate workaround.

🕯️

Silent data corruption (bit rot)

Disk returns wrong bytes, hash mismatches. If undetected, corruption propagates through rebuilds.

→ Mitigation: continuous background scrubber reads every shard, verifies hash, repairs from parity. Every object has end-to-end MD5/SHA that client + server verify on write and read.

🚨

Accidental public-bucket disaster

Customer misconfigures bucket policy; data leaks. This has caused major breaches (Verizon, Dow Jones, countless others).

→ Mitigation: "Block Public Access" account-level default since 2018; UI flags public buckets prominently; new buckets private-by-default. It's a product + UX mitigation, not just engineering.

🔁

Orphaned multipart uploads

Customer initiates multipart upload, uploads 50 parts, never calls Complete or Abort. Parts sit in storage indefinitely, billing the customer.

→ Mitigation: lifecycle rule auto-abort incomplete multipart uploads after N days. Bucket storage analytics surfaces orphaned uploads to customers.

🌍

Region-wide outage

us-east-1 goes down; customers with data only in that region are offline for hours.

→ Mitigation: cross-region replication (CRR) feature; customer opts in, metadata + bytes replicated asynchronously to another region. Not free — doubles storage cost for replicated buckets.

Interview Tips

Durability math is the core. 11 nines is a number. Work backward from it: what does that imply about replication factor, geographic spread, verification?
Name erasure coding by type. "Reed-Solomon (10, 4)" or "(12, 4)" is much more credible than "we use erasure coding." Know the storage-vs-durability tradeoff by heart.
Split the two planes. Index service (metadata) vs data fleet (bytes). Don't muddle them. The index is transactional; bytes are immutable blobs.
Hot partitions are the classic follow-up. Interviewer will ask "what if everyone's key starts with today's date?" Have the auto-split + hash-prefix answer ready.
Multipart upload is more than chunking. It's resumable, parallelizable, and lets you abort partial work. Describe each of those benefits distinctly.

Evolution

MVP — replicated blobs + flat namespace

3× replication. Single KV for metadata. No prefix optimization. Works to ~100 TB. S3 launched in 2006 roughly here.

Erasure-coded storage + prefix sharding

RS(10,4) replaces 3× replication → 2× storage savings. Metadata sharded by key prefix. Hot partitions manual-mitigated (customer must randomize prefix).

Storage tiering + lifecycle policies

Standard / Infrequent Access / Glacier. Automatic transition based on access patterns. Customers pay 10× less for cold data.

Auto-partitioning + strong consistency

Hot prefixes auto-split without customer action (2018). Read-after-write strong consistency globally (2020). Huge operational improvements for end users.

Intelligent-Tiering + Object Lambda

Service learns access patterns per object, auto-tiers. Lambda functions can transform data on GET (rewrite response bytes). The storage primitive extends into a compute fabric.

📺