Real failures.
Real root causes.
Real-world outages from AWS, Cloudflare, GitHub, GitLab, Meta, Slack, Roblox, Reddit, Datadog and others — each a canonical teaching case for a specific failure mode. Read before system design interviews to calibrate what actually goes wrong at scale.
Deploy / operator error
6 shippedS3 us-east-1
A single command typo during debugging a billing system took down significant fractions of the internet — Slack, Trello, Quora, Docker Hub, and AWS's own status page — for roughly 4 hours. The blast radius caught Amazon
Knight Capital
A stale deployment on 1 of 8 trading servers caused Knight's algorithm to submit millions of unintended orders in 45 minutes. Loss: $440 million. The firm was insolvent by end of day; absorbed by a competitor within week
Accidentally Deleted
An engineer, tired and debugging a replication issue at 11pm, ran rm -rf on the production primary database directory instead of the replica. Five of six backup methods were later discovered to have been silently failing
Kernel Driver
A content update to CrowdStrike's Falcon kernel driver caused an out-of-bounds read that crashed Windows at boot. 8.5 million Windows hosts globally went into boot loops. Airlines grounded, hospitals diverted ambulances,
"Pi Day"
A planned Kubernetes control-plane upgrade introduced an incompatibility with a custom Calico (CNI) fork Reddit had been maintaining for years. Routing broke across pods. Reddit.com went fully dark for 5 hours — the long
systemd Update
A routine Ubuntu systemd security update, applied automatically to tens of thousands of Datadog VMs across all 5 regions, broke networking on every affected host. Datadog's observability service — the one customers trust
Coordination + consensus
3 shipped43-Second Partition
A 43-second network partition between GitHub's US East and US West data centers caused MySQL Orchestrator to promote a new primary in the wrong region. 43 seconds of split-brain writes took 24 hours to reconcile. Webhook
73 Hours Down
Enabling a new streaming feature in Consul KV triggered a pathological interaction with BoltDB, Consul's embedded storage. Write latency exploded; Consul became unresponsive. Recovery took 73 hours — the worst outage in
Consul + HAProxy
A brief network blip between Slack's service registry (Consul) and its load balancers (HAProxy) caused a metastable failure: even after the network recovered, the system stayed broken because recovering generated more wo
Capacity + cascading failure
2 shippedFirst-Day-Of-Year
On the first working day of 2021, everyone returning from holiday logged into Slack at once. The spike exposed a surprising limit on AWS Transit Gateway capacity. Retries from clients amplified the load into sustained sa
Catastrophic Regex
A single regular expression in a Cloudflare WAF rule consumed all CPU on every Cloudflare machine worldwide within seconds. 502s for everyone Cloudflare fronted — ~10% of the internet — for 27 minutes.