Concept · Data Structures

Erasure Coding

01

Why this matters

Storing 1 PB of files with 3× replication = 3 PB of disk. Storage at scale is dominated by disk cost, and 3× is expensive — Facebook's photos, Netflix's video, Google's mail all add up to exabytes. Erasure coding achieves the same durability with ~1.5× overhead instead of 3×. Half the disk for the same protection. The catch: writes and rebuilds become CPU-heavy and slower. Used by S3, HDFS, Ceph, Backblaze.

02

The (k, m) scheme

Take your data block, split it into k data chunks. Compute m additional parity chunks via Reed-Solomon math. Store all k+m chunks on different disks/servers.

Magic property: any k of the k+m chunks are sufficient to reconstruct the original data. You can lose up to m chunks and still recover.

Common configurations:

  • (10, 4) — Facebook's HDFS. Tolerate 4 disk failures. Storage overhead: 14/10 = 1.4×.
  • (6, 3) — Backblaze. Tolerate 3 failures. Overhead: 9/6 = 1.5×.
  • (17, 3) — S3 inferred. Overhead: 20/17 ≈ 1.18×. Eleven nines durability with one-third the storage of 3× replication.
03

Reed-Solomon in one paragraph

The data chunks are coefficients of a polynomial. The parity chunks are evaluations of that polynomial at additional points. Given any k of the k+m points, you can interpolate the polynomial uniquely (since it's degree k-1) and recover all k coefficients = original data.

Practical implementation uses Galois fields (GF(2⁸) or GF(2¹⁶)) for arithmetic over bytes. Modern CPUs (AVX2, ARM NEON) accelerate this with SIMD instructions, achieving multi-GB/s encoding speeds.

04

Erasure coding vs replication

Erasure coding

Cheap storage, expensive operations

1.2–1.5× storage overhead. Write requires encoding (CPU + small overhead). Read is fast if all data chunks available — falls back to expensive reconstruction if any are missing. Rebuilding a failed disk requires reading from k other disks. Best for cold/warm storage where reads are infrequent and storage cost dominates.

Replication

Expensive storage, cheap operations

3× storage overhead. Writes go to all replicas (parallel, no CPU). Reads from any one replica. Rebuilding = streaming from one healthy replica. Best for hot data where read latency matters most. Most DBs default to this.

The tiered approach

Modern systems use both: replication for hot data (last 7 days), erasure coding for cold (older). When you ask AWS about S3 Glacier vs S3 Standard, you're paying for this tradeoff — Glacier uses erasure coding at the price of slower retrieval.

05

Deep dive — the rebuild storm problem

One disk fails in a (10, 4) scheme. To rebuild it, you read from 10 other disks, do polynomial reconstruction, write to a new disk. This consumes IOPS on 10 disks for hours.

Now imagine an entire rack failure (16 disks down at once). 16 simultaneous rebuilds, each pulling from 10 disks → many disks see 5–10× their normal read load. Rebuild storm. Latency for normal traffic spikes. If a second failure happens during this window, you might cross the threshold (more than m=4 failures across a stripe) → permanent data loss.

Mitigations:

  • Spread chunks across racks/AZs to minimize correlated failure.
  • Throttle rebuild bandwidth to keep normal traffic responsive.
  • Use Local Reconstruction Codes (LRC, used by Microsoft Azure) — adds a few "local parity" chunks so most rebuilds only need a few neighbors instead of k.

The interview answer: erasure coding saves money on storage but introduces operational complexity around failures. Choose 3× replication unless you're at petabyte scale where the cost gap matters.

Rebuild Storm — (10,4) Stripe, 1 Disk Lost SVG
Time to rebuild a 4 TB disk in (10,4) RS coding Unthrottled read 4 TB ÷ 10 disks @ 200 MB/s = ~33 min → slows traffic Throttled (50%) ~66 min · client traffic preserved LRC (local parity) ~10 min · 3 local reads t=0 fail ~66 min later During this window: another failure → permanent data loss P(2nd disk in same stripe fails per hour) × rebuild time = data-loss probability
(10,4)
Facebook HDFS scheme
200 MB/s
typical disk read
~33 min
unthrottled rebuild of 4 TB
3-4 hr
throttled real-world rebuild
06

Real-world

AWS S3 / Azure Blob

Erasure coded by default

~1.18× overhead at exabyte scale = billions in CapEx savings. Eleven nines durability per object.

Facebook HDFS

Cold storage tier

(10, 4) RS for warm-cold data. Hot data still 3× replicated. Saves petabytes.

Ceph / MinIO

Configurable

Pools can be replicated or erasure-coded. Operators tune per-workload.

Backblaze B2

(17, 3) Reed-Solomon

Famous for publishing exact code + recovery procedures. The textbook example of erasure coding at scale.

07

Used in problems

Google Drive uses erasure coding for older / less-accessed files. YouTube/Netflix store original video masters and rare-region copies with EC. Distributed logging tier-2 storage uses EC for archived logs.

Next up