A bucket-and-key object store with read-your-writes consistency, ~11 nines of durability, and exabyte-scale capacity. The hard parts:
a key → bytes service that doesn't fall over at millions of requests per second;
erasure-coded durability cheaper than 3× replication but rebuilds gracefully;
and a partition-key-hot-spot story — what happens when everyone writes to keys starting with 2026-04-12/.
S3 stores hundreds of trillions of objects and serves tens of millions of requests per second.
PUT / GET / DELETE on (bucket, key) tuples; opaque byte contents
Object size from 0 bytes to 5 TB; multipart upload for large ones
Bucket ACLs + per-object ACLs; IAM integration; pre-signed URLs
Versioning per bucket (opt-in): preserve old object versions on overwrite
List objects by prefix (for file-system-like browse)
Lifecycle: transition to Glacier after N days; auto-delete after M
Events: emit notifications (SNS/SQS/Lambda) on put/delete
Non-Functional
11 nines durability (99.999999999%) — one object in 10 billion lost per year
Read-after-write consistency for new objects (strong, since Dec 2020)
Scale individual buckets to ~5,500 PUT / 3,500 GET per partition per second; horizontally unbounded aggregate
Available across an AWS region with AZ redundancy (3 AZs minimum)
Integrity verified via MD5/SHA on every byte
Pay-for-what-you-store: cold tiers 10× cheaper than hot
03
Scale Estimation
Objects stored
~280T
AWS disclosed: 280 trillion objects as of 2023; growing double-digit % YoY
Request peak
~100M req/sec
aggregate across all customers; individual buckets throttle at ~5500 PUT/sec per prefix
Durability target
11 nines
achieved via Reed-Solomon erasure coding + cross-AZ placement
Storage overhead
~1.4×
RS(10,4) stores 14 shards for every 10 data shards — cheaper than 3× replication
Partition key space
1024
bucket split into ~1024 index partitions; auto-splits on load
Disk rebuild time
~hours
failed disk rebuild = read surviving shards, XOR to recover; bandwidth-bounded
04
API Design
PUT/{bucket}/{key}
Upload an object. Body = bytes. Headers: Content-MD5 (for verification), Storage-Class (STANDARD / IA / GLACIER). Returns ETag (MD5 of object). Idempotent on identical bytes.
GET/{bucket}/{key}
Fetch object. Supports Range headers for partial reads (Range: bytes=0-1048575). Returns object bytes + metadata headers (x-amz-meta-*).
POST/{bucket}/{key}?uploads
Initiate multipart upload. Returns UploadId. For objects > 100 MB or resumable uploads. Client then uploads parts in parallel.
PUT/{bucket}/{key}?partNumber=N&uploadId=ID
Upload part (minimum 5 MB except last). Returns ETag for that part. Parts can be uploaded in parallel; max 10,000 parts per object.
POST/{bucket}/{key}?uploadId=ID
Complete multipart. Body = ordered list of (partNumber, ETag). Server validates all parts exist; stitches into single object; returns final object ETag.
List objects. Cursor-based pagination. Heavy use case — dominates hot-partition problems.
DELETE/{bucket}/{key}
Delete object. With versioning enabled, creates a delete marker; actual bytes retained until version is explicitly deleted.
05
Architecture
Two distinct planes. The index plane (metadata: (bucket, key) → physical locations) is a sharded, highly-available KV service. The data plane (actual bytes, erasure-coded across disks and AZs) is the storage fleet. A request front-end authenticates + routes; a background garbage collector reclaims deleted bytes after safe intervals.
Deep Dive — Durability via Erasure Coding + Hot Partitions
S3 promises "eleven nines" of durability. That's not a marketing line — it's a budget. 99.999999999% means if you store 10B objects you expect to lose ~1 per year. Achieving this cheaply is the central engineering challenge.
Triple replication is too expensive. Storing 3 copies gives ~6 nines and costs 3× storage. For exabyte scale, that's billions of dollars of overhead.
Reed-Solomon erasure coding. Split each object into 10 data shards; compute 4 parity shards. Any 10 of 14 can reconstruct the whole object. Losing 4 independent disks simultaneously (astronomically unlikely if spread across AZs) is required to lose data. Storage overhead: 1.4× instead of 3×. Durability: ~11 nines when placed across multiple AZs.
Write Path — Erasure-coded placementMermaid
sequenceDiagram
participant C as Client
participant FE as Front-end
participant IX as Index svc
participant ENC as RS Encoder
participant S as Storage fleet
C->>FE: PUT /bucket/key (bytes)
FE->>IX: allocate object_id
IX-->>FE: object_id + partition hint
FE->>ENC: split + encode bytes
ENC->>ENC: produce 10 data + 4 parity shards
par write across AZs
ENC->>S: shard 1→10 (AZ a/b/c/d)
ENC->>S: parity 1→4
end
S-->>FE: all 14 acked
FE->>IX: commit {object_id, shard_locations, etag}
IX-->>FE: committed
FE-->>C: 200 OK + ETag
Hot partition problem. The metadata index is sharded. Historically S3 sharded by a hash of the full key, but ALSO used the key prefix for list performance. A naming pattern like 2026/04/12/log-XXX.json meant all today's writes hit the same few metadata partitions. Throughput caps at ~5500 PUT/sec per prefix; worst-case customers saw throttling.
S3 solved this in two steps:
Hot partition auto-split. Metadata service detects rising QPS on a partition and automatically splits it into two, rebalancing keys. Background process; live traffic largely unaffected.
Auto-partitioning (2018 launch). The service learned to auto-shard by key prefix entropy. Today's customers don't need to add random hash prefixes to their key names — S3 handles it. But for extreme rates, the old advice still helps: ${random_hex}/2026/04/12/log.json spreads load evenly.
Rebuild storm. When a disk fails, the system re-reads 10 of the 13 surviving shards per affected object to rebuild the lost shard onto a new disk. That's 10× traffic spike on the surviving disks during rebuild. Mitigations: prioritize low-redundancy objects, throttle rebuild bandwidth to prevent starving live traffic, and keep spare capacity so even a major failure doesn't push the system into the red zone.
Multipart upload for big objects. A 5 TB object can't reasonably be one PUT. Multipart lets you:
Upload parts in parallel (higher throughput).
Resume on failure (only re-upload the failed part).
Abort to reclaim space (lifecycle policies clean stale uploads).
Each part is stored as its own erasure-coded blob; the final CompleteMultipartUpload just writes a manifest. No byte-level stitching required.
Interview answer
"Two planes — index (metadata) and data (bytes). Data plane uses Reed-Solomon (10, 4) erasure coding spread across 3+ AZs: 14 shards, any 10 rebuild the object, 1.4× storage overhead, 11 nines durability. Index plane is sharded by bucket key; auto-splits hot partitions. Front-end fleet authenticates + routes. Multipart upload splits big objects into parallel, resumable parts. Background control plane handles GC, lifecycle tiering, event emission, and cross-region replication. The hard parts are durability math and hot-partition auto-handling — not the API surface."
07
Tradeoffs & Design Choices
Erasure coding vs replication. EC saves storage but costs more CPU (encode/decode on every read/write) and higher recovery bandwidth. Worth it only at large scale. Small-scale systems stick with replication.
Strong vs eventual consistency. S3 became strong read-after-write in 2020. Before: writes could take minutes to propagate to all index replicas, leading to "missing object" errors right after upload. Now: linearizable on new objects, eventual only on list operations. Users pay nothing extra — AWS absorbed the engineering cost.
Storage tiering. Hot (SSD), warm (HDD), cold (Glacier = tape / slow disk). 10× cost difference between tiers. Lifecycle policies auto-migrate; reads from cold tiers take minutes-to-hours. Important to expose as an explicit product rather than hiding.
Per-object metadata overhead. Every object has a few KB of metadata (ACL, storage class, lifecycle markers). At 280T objects, that's ~1+ PB of metadata alone — not ignorable.
Eventual consistency on list. List operations scan metadata partitions and may miss just-written objects for seconds. Users who need "list includes everything I wrote" either wait a few seconds or call GetObject directly.
08
Failure Modes
💾
Concurrent disk failures during rebuild
One disk fails; during rebuild a 2nd in the same stripe fails; then a 3rd. With RS(10,4) we can tolerate 4 losses; beyond that → data loss.
→ Mitigation: (a) spread shards across AZs so correlated power/network failures don't take out multiple shards; (b) prioritize rebuilds for stripes with fewer surviving shards; (c) cap disk fleet utilization so rebuild has headroom.
🔥
Hot prefix throttling
Customer logs keyed 2026/04/12/evt-000001.json all hit the same metadata partition; 5500 PUT/sec ceiling hit; customer sees 503s.
→ Mitigation: auto-split triggered by sustained high QPS; split takes a few minutes. Customer guidance: prefix with a hash ff23/2026/04/12/evt-... for immediate workaround.
🕯️
Silent data corruption (bit rot)
Disk returns wrong bytes, hash mismatches. If undetected, corruption propagates through rebuilds.
→ Mitigation: continuous background scrubber reads every shard, verifies hash, repairs from parity. Every object has end-to-end MD5/SHA that client + server verify on write and read.
🚨
Accidental public-bucket disaster
Customer misconfigures bucket policy; data leaks. This has caused major breaches (Verizon, Dow Jones, countless others).
→ Mitigation: "Block Public Access" account-level default since 2018; UI flags public buckets prominently; new buckets private-by-default. It's a product + UX mitigation, not just engineering.
🔁
Orphaned multipart uploads
Customer initiates multipart upload, uploads 50 parts, never calls Complete or Abort. Parts sit in storage indefinitely, billing the customer.
→ Mitigation: lifecycle rule auto-abort incomplete multipart uploads after N days. Bucket storage analytics surfaces orphaned uploads to customers.
🌍
Region-wide outage
us-east-1 goes down; customers with data only in that region are offline for hours.
→ Mitigation: cross-region replication (CRR) feature; customer opts in, metadata + bytes replicated asynchronously to another region. Not free — doubles storage cost for replicated buckets.
09
Interview Tips
Durability math is the core. 11 nines is a number. Work backward from it: what does that imply about replication factor, geographic spread, verification?
Name erasure coding by type. "Reed-Solomon (10, 4)" or "(12, 4)" is much more credible than "we use erasure coding." Know the storage-vs-durability tradeoff by heart.
Split the two planes. Index service (metadata) vs data fleet (bytes). Don't muddle them. The index is transactional; bytes are immutable blobs.
Hot partitions are the classic follow-up. Interviewer will ask "what if everyone's key starts with today's date?" Have the auto-split + hash-prefix answer ready.
Multipart upload is more than chunking. It's resumable, parallelizable, and lets you abort partial work. Describe each of those benefits distinctly.
3× replication. Single KV for metadata. No prefix optimization. Works to ~100 TB. S3 launched in 2006 roughly here.
2
Erasure-coded storage + prefix sharding
RS(10,4) replaces 3× replication → 2× storage savings. Metadata sharded by key prefix. Hot partitions manual-mitigated (customer must randomize prefix).
3
Storage tiering + lifecycle policies
Standard / Infrequent Access / Glacier. Automatic transition based on access patterns. Customers pay 10× less for cold data.
4
Auto-partitioning + strong consistency
Hot prefixes auto-split without customer action (2018). Read-after-write strong consistency globally (2020). Huge operational improvements for end users.
5
Intelligent-Tiering + Object Lambda
Service learns access patterns per object, auto-tiers. Lambda functions can transform data on GET (rewrite response bytes). The storage primitive extends into a compute fabric.