Amazon S3

A bucket-and-key object store with read-your-writes consistency, ~11 nines of durability, and exabyte-scale capacity. The hard parts: a key → bytes service that doesn't fall over at millions of requests per second; erasure-coded durability cheaper than 3× replication but rebuilds gracefully; and a partition-key-hot-spot story — what happens when everyone writes to keys starting with 2026-04-12/. S3 stores hundreds of trillions of objects and serves tens of millions of requests per second.

⚡ Core: Durability + Hot-Partition Handling11 nines durability100T+ objects~100M req/sec peakExabyte storage
02

Requirements

Functional
  • PUT / GET / DELETE on (bucket, key) tuples; opaque byte contents
  • Object size from 0 bytes to 5 TB; multipart upload for large ones
  • Bucket ACLs + per-object ACLs; IAM integration; pre-signed URLs
  • Versioning per bucket (opt-in): preserve old object versions on overwrite
  • List objects by prefix (for file-system-like browse)
  • Lifecycle: transition to Glacier after N days; auto-delete after M
  • Events: emit notifications (SNS/SQS/Lambda) on put/delete
Non-Functional
  • 11 nines durability (99.999999999%) — one object in 10 billion lost per year
  • Read-after-write consistency for new objects (strong, since Dec 2020)
  • Scale individual buckets to ~5,500 PUT / 3,500 GET per partition per second; horizontally unbounded aggregate
  • Available across an AWS region with AZ redundancy (3 AZs minimum)
  • Integrity verified via MD5/SHA on every byte
  • Pay-for-what-you-store: cold tiers 10× cheaper than hot
03

Scale Estimation

Objects stored
~280T
AWS disclosed: 280 trillion objects as of 2023; growing double-digit % YoY
Request peak
~100M req/sec
aggregate across all customers; individual buckets throttle at ~5500 PUT/sec per prefix
Durability target
11 nines
achieved via Reed-Solomon erasure coding + cross-AZ placement
Storage overhead
~1.4×
RS(10,4) stores 14 shards for every 10 data shards — cheaper than 3× replication
Partition key space
1024
bucket split into ~1024 index partitions; auto-splits on load
Disk rebuild time
~hours
failed disk rebuild = read surviving shards, XOR to recover; bandwidth-bounded
04

API Design

PUT/{bucket}/{key}

Upload an object. Body = bytes. Headers: Content-MD5 (for verification), Storage-Class (STANDARD / IA / GLACIER). Returns ETag (MD5 of object). Idempotent on identical bytes.

GET/{bucket}/{key}

Fetch object. Supports Range headers for partial reads (Range: bytes=0-1048575). Returns object bytes + metadata headers (x-amz-meta-*).

POST/{bucket}/{key}?uploads

Initiate multipart upload. Returns UploadId. For objects > 100 MB or resumable uploads. Client then uploads parts in parallel.

PUT/{bucket}/{key}?partNumber=N&uploadId=ID

Upload part (minimum 5 MB except last). Returns ETag for that part. Parts can be uploaded in parallel; max 10,000 parts per object.

POST/{bucket}/{key}?uploadId=ID

Complete multipart. Body = ordered list of (partNumber, ETag). Server validates all parts exist; stitches into single object; returns final object ETag.

GET/{bucket}?prefix=folder/&delimiter=/&max-keys=1000

List objects. Cursor-based pagination. Heavy use case — dominates hot-partition problems.

DELETE/{bucket}/{key}

Delete object. With versioning enabled, creates a delete marker; actual bytes retained until version is explicitly deleted.

05

Architecture

Two distinct planes. The index plane (metadata: (bucket, key) → physical locations) is a sharded, highly-available KV service. The data plane (actual bytes, erasure-coded across disks and AZs) is the storage fleet. A request front-end authenticates + routes; a background garbage collector reclaims deleted bytes after safe intervals.

S3 Request Flow + Storage Layout SVG
Client / SDK PUT / GET Front-end fleet auth + route Index plane (metadata) Index svc (bucket,key) lookup Auth / ACL IAM policy eval Metadata partitions (~1024 per bucket) sharded KV; auto-split on hot prefix Data plane (bytes) AZ-a shards AZ-b shards AZ-c shards AZ-d shards RS(10, 4) 14 shards / object Background control plane GC / repair scrub + rebuild Lifecycle tier / expire Event svc SNS / SQS fan-out Replication cross-region Storage nodes — thousands of disks / each runs local placement service Reed-Solomon encoder / decoder embedded in write + read path
Request Flow — Step Through
Client · PUT /bucket/keyFront-end · auth + routeIndex alloc · object_id + partitionRS encoder · 10 data + 4 parityAZ placement · cross 3+ AZsDurability ack · all 14 shardsCommit + ETag · metadata committed
Click Next Step to walk through the request flow.
06

Deep Dive — Durability via Erasure Coding + Hot Partitions

S3 promises "eleven nines" of durability. That's not a marketing line — it's a budget. 99.999999999% means if you store 10B objects you expect to lose ~1 per year. Achieving this cheaply is the central engineering challenge.

Triple replication is too expensive. Storing 3 copies gives ~6 nines and costs 3× storage. For exabyte scale, that's billions of dollars of overhead.

Reed-Solomon erasure coding. Split each object into 10 data shards; compute 4 parity shards. Any 10 of 14 can reconstruct the whole object. Losing 4 independent disks simultaneously (astronomically unlikely if spread across AZs) is required to lose data. Storage overhead: 1.4× instead of 3×. Durability: ~11 nines when placed across multiple AZs.

Write Path — Erasure-coded placement Mermaid
sequenceDiagram participant C as Client participant FE as Front-end participant IX as Index svc participant ENC as RS Encoder participant S as Storage fleet C->>FE: PUT /bucket/key (bytes) FE->>IX: allocate object_id IX-->>FE: object_id + partition hint FE->>ENC: split + encode bytes ENC->>ENC: produce 10 data + 4 parity shards par write across AZs ENC->>S: shard 1→10 (AZ a/b/c/d) ENC->>S: parity 1→4 end S-->>FE: all 14 acked FE->>IX: commit {object_id, shard_locations, etag} IX-->>FE: committed FE-->>C: 200 OK + ETag

Hot partition problem. The metadata index is sharded. Historically S3 sharded by a hash of the full key, but ALSO used the key prefix for list performance. A naming pattern like 2026/04/12/log-XXX.json meant all today's writes hit the same few metadata partitions. Throughput caps at ~5500 PUT/sec per prefix; worst-case customers saw throttling.

S3 solved this in two steps:

  1. Hot partition auto-split. Metadata service detects rising QPS on a partition and automatically splits it into two, rebalancing keys. Background process; live traffic largely unaffected.
  2. Auto-partitioning (2018 launch). The service learned to auto-shard by key prefix entropy. Today's customers don't need to add random hash prefixes to their key names — S3 handles it. But for extreme rates, the old advice still helps: ${random_hex}/2026/04/12/log.json spreads load evenly.

Rebuild storm. When a disk fails, the system re-reads 10 of the 13 surviving shards per affected object to rebuild the lost shard onto a new disk. That's 10× traffic spike on the surviving disks during rebuild. Mitigations: prioritize low-redundancy objects, throttle rebuild bandwidth to prevent starving live traffic, and keep spare capacity so even a major failure doesn't push the system into the red zone.

Multipart upload for big objects. A 5 TB object can't reasonably be one PUT. Multipart lets you:

  • Upload parts in parallel (higher throughput).
  • Resume on failure (only re-upload the failed part).
  • Abort to reclaim space (lifecycle policies clean stale uploads).

Each part is stored as its own erasure-coded blob; the final CompleteMultipartUpload just writes a manifest. No byte-level stitching required.

Interview answer

"Two planes — index (metadata) and data (bytes). Data plane uses Reed-Solomon (10, 4) erasure coding spread across 3+ AZs: 14 shards, any 10 rebuild the object, 1.4× storage overhead, 11 nines durability. Index plane is sharded by bucket key; auto-splits hot partitions. Front-end fleet authenticates + routes. Multipart upload splits big objects into parallel, resumable parts. Background control plane handles GC, lifecycle tiering, event emission, and cross-region replication. The hard parts are durability math and hot-partition auto-handling — not the API surface."

07

Tradeoffs & Design Choices

  • Erasure coding vs replication. EC saves storage but costs more CPU (encode/decode on every read/write) and higher recovery bandwidth. Worth it only at large scale. Small-scale systems stick with replication.
  • Strong vs eventual consistency. S3 became strong read-after-write in 2020. Before: writes could take minutes to propagate to all index replicas, leading to "missing object" errors right after upload. Now: linearizable on new objects, eventual only on list operations. Users pay nothing extra — AWS absorbed the engineering cost.
  • Storage tiering. Hot (SSD), warm (HDD), cold (Glacier = tape / slow disk). 10× cost difference between tiers. Lifecycle policies auto-migrate; reads from cold tiers take minutes-to-hours. Important to expose as an explicit product rather than hiding.
  • Per-object metadata overhead. Every object has a few KB of metadata (ACL, storage class, lifecycle markers). At 280T objects, that's ~1+ PB of metadata alone — not ignorable.
  • Eventual consistency on list. List operations scan metadata partitions and may miss just-written objects for seconds. Users who need "list includes everything I wrote" either wait a few seconds or call GetObject directly.
08

Failure Modes

💾
Concurrent disk failures during rebuild
One disk fails; during rebuild a 2nd in the same stripe fails; then a 3rd. With RS(10,4) we can tolerate 4 losses; beyond that → data loss.
→ Mitigation: (a) spread shards across AZs so correlated power/network failures don't take out multiple shards; (b) prioritize rebuilds for stripes with fewer surviving shards; (c) cap disk fleet utilization so rebuild has headroom.
🔥
Hot prefix throttling
Customer logs keyed 2026/04/12/evt-000001.json all hit the same metadata partition; 5500 PUT/sec ceiling hit; customer sees 503s.
→ Mitigation: auto-split triggered by sustained high QPS; split takes a few minutes. Customer guidance: prefix with a hash ff23/2026/04/12/evt-... for immediate workaround.
🕯️
Silent data corruption (bit rot)
Disk returns wrong bytes, hash mismatches. If undetected, corruption propagates through rebuilds.
→ Mitigation: continuous background scrubber reads every shard, verifies hash, repairs from parity. Every object has end-to-end MD5/SHA that client + server verify on write and read.
🚨
Accidental public-bucket disaster
Customer misconfigures bucket policy; data leaks. This has caused major breaches (Verizon, Dow Jones, countless others).
→ Mitigation: "Block Public Access" account-level default since 2018; UI flags public buckets prominently; new buckets private-by-default. It's a product + UX mitigation, not just engineering.
🔁
Orphaned multipart uploads
Customer initiates multipart upload, uploads 50 parts, never calls Complete or Abort. Parts sit in storage indefinitely, billing the customer.
→ Mitigation: lifecycle rule auto-abort incomplete multipart uploads after N days. Bucket storage analytics surfaces orphaned uploads to customers.
🌍
Region-wide outage
us-east-1 goes down; customers with data only in that region are offline for hours.
→ Mitigation: cross-region replication (CRR) feature; customer opts in, metadata + bytes replicated asynchronously to another region. Not free — doubles storage cost for replicated buckets.
09

Interview Tips

  1. Durability math is the core. 11 nines is a number. Work backward from it: what does that imply about replication factor, geographic spread, verification?
  2. Name erasure coding by type. "Reed-Solomon (10, 4)" or "(12, 4)" is much more credible than "we use erasure coding." Know the storage-vs-durability tradeoff by heart.
  3. Split the two planes. Index service (metadata) vs data fleet (bytes). Don't muddle them. The index is transactional; bytes are immutable blobs.
  4. Hot partitions are the classic follow-up. Interviewer will ask "what if everyone's key starts with today's date?" Have the auto-split + hash-prefix answer ready.
  5. Multipart upload is more than chunking. It's resumable, parallelizable, and lets you abort partial work. Describe each of those benefits distinctly.
11

Evolution

1

MVP — replicated blobs + flat namespace

3× replication. Single KV for metadata. No prefix optimization. Works to ~100 TB. S3 launched in 2006 roughly here.

2

Erasure-coded storage + prefix sharding

RS(10,4) replaces 3× replication → 2× storage savings. Metadata sharded by key prefix. Hot partitions manual-mitigated (customer must randomize prefix).

3

Storage tiering + lifecycle policies

Standard / Infrequent Access / Glacier. Automatic transition based on access patterns. Customers pay 10× less for cold data.

4

Auto-partitioning + strong consistency

Hot prefixes auto-split without customer action (2018). Read-after-write strong consistency globally (2020). Huge operational improvements for end users.

5

Intelligent-Tiering + Object Lambda

Service learns access patterns per object, auto-tiers. Lambda functions can transform data on GET (rewrite response bytes). The storage primitive extends into a compute fabric.

Next up