Storing 1 PB of files with 3× replication = 3 PB of disk. Storage at scale is dominated by disk cost, and 3× is expensive — Facebook's photos, Netflix's video, Google's mail all add up to exabytes. Erasure coding achieves the same durability with ~1.5× overhead instead of 3×. Half the disk for the same protection. The catch: writes and rebuilds become CPU-heavy and slower. Used by S3, HDFS, Ceph, Backblaze.
02
The (k, m) scheme
Take your data block, split it into k data chunks. Compute m additional parity chunks via Reed-Solomon math. Store all k+m chunks on different disks/servers.
Magic property: any k of the k+m chunks are sufficient to reconstruct the original data. You can lose up to m chunks and still recover.
(17, 3) — S3 inferred. Overhead: 20/17 ≈ 1.18×. Eleven nines durability with one-third the storage of 3× replication.
03
Reed-Solomon in one paragraph
The data chunks are coefficients of a polynomial. The parity chunks are evaluations of that polynomial at additional points. Given any k of the k+m points, you can interpolate the polynomial uniquely (since it's degree k-1) and recover all k coefficients = original data.
Practical implementation uses Galois fields (GF(2⁸) or GF(2¹⁶)) for arithmetic over bytes. Modern CPUs (AVX2, ARM NEON) accelerate this with SIMD instructions, achieving multi-GB/s encoding speeds.
04
Erasure coding vs replication
Erasure coding
Cheap storage, expensive operations
1.2–1.5× storage overhead. Write requires encoding (CPU + small overhead). Read is fast if all data chunks available — falls back to expensive reconstruction if any are missing. Rebuilding a failed disk requires reading from k other disks. Best for cold/warm storage where reads are infrequent and storage cost dominates.
Replication
Expensive storage, cheap operations
3× storage overhead. Writes go to all replicas (parallel, no CPU). Reads from any one replica. Rebuilding = streaming from one healthy replica. Best for hot data where read latency matters most. Most DBs default to this.
The tiered approach
Modern systems use both: replication for hot data (last 7 days), erasure coding for cold (older). When you ask AWS about S3 Glacier vs S3 Standard, you're paying for this tradeoff — Glacier uses erasure coding at the price of slower retrieval.
05
Deep dive — the rebuild storm problem
One disk fails in a (10, 4) scheme. To rebuild it, you read from 10 other disks, do polynomial reconstruction, write to a new disk. This consumes IOPS on 10 disks for hours.
Now imagine an entire rack failure (16 disks down at once). 16 simultaneous rebuilds, each pulling from 10 disks → many disks see 5–10× their normal read load. Rebuild storm. Latency for normal traffic spikes. If a second failure happens during this window, you might cross the threshold (more than m=4 failures across a stripe) → permanent data loss.
Mitigations:
Spread chunks across racks/AZs to minimize correlated failure.
Throttle rebuild bandwidth to keep normal traffic responsive.
Use Local Reconstruction Codes (LRC, used by Microsoft Azure) — adds a few "local parity" chunks so most rebuilds only need a few neighbors instead of k.
The interview answer: erasure coding saves money on storage but introduces operational complexity around failures. Choose 3× replication unless you're at petabyte scale where the cost gap matters.
Rebuild Storm — (10,4) Stripe, 1 Disk LostSVG
(10,4)
Facebook HDFS scheme
200 MB/s
typical disk read
~33 min
unthrottled rebuild of 4 TB
3-4 hr
throttled real-world rebuild
06
Real-world
AWS S3 / Azure Blob
Erasure coded by default
~1.18× overhead at exabyte scale = billions in CapEx savings. Eleven nines durability per object.
Facebook HDFS
Cold storage tier
(10, 4) RS for warm-cold data. Hot data still 3× replicated. Saves petabytes.
Ceph / MinIO
Configurable
Pools can be replicated or erasure-coded. Operators tune per-workload.
Backblaze B2
(17, 3) Reed-Solomon
Famous for publishing exact code + recovery procedures. The textbook example of erasure coding at scale.
07
Used in problems
Google Drive uses erasure coding for older / less-accessed files. YouTube/Netflix store original video masters and rare-region copies with EC. Distributed logging tier-2 storage uses EC for archived logs.