Post-mortems

Real failures.
Real root causes.

Real-world outages from AWS, Cloudflare, GitHub, GitLab, Meta, Slack, Roblox, Reddit, Datadog and others — each a canonical teaching case for a specific failure mode. Read before system design interviews to calibrate what actually goes wrong at scale.

12 post-mortems4 failure-mode clusterspress ⌘K to search

Deploy / operator error

6 shipped

Post-mortem3 min read

S3 us-east-1

A single command typo during debugging a billing system took down significant fractions of the internet — Slack, Trello, Quora, Docker Hub, and AWS's own status page — for roughly 4 hours. The blast radius caught Amazon

Operator errorus-east-1~4 hour outage

Post-mortem3 min read

Knight Capital

A stale deployment on 1 of 8 trading servers caused Knight's algorithm to submit millions of unintended orders in 45 minutes. Loss: $440 million. The firm was insolvent by end of day; absorbed by a competitor within week

Stale deployTrading$440M loss

Post-mortem3 min read

Accidentally Deleted

An engineer, tired and debugging a replication issue at 11pm, ran rm -rf on the production primary database directory instead of the replica. Five of six backup methods were later discovered to have been silently failing

Operator errorBackups failing silentlyData loss

Post-mortem3 min read

Kernel Driver

A content update to CrowdStrike's Falcon kernel driver caused an out-of-bounds read that crashed Windows at boot. 8.5 million Windows hosts globally went into boot loops. Airlines grounded, hospitals diverted ambulances,

Kernel driver8.5M hostsGlobal

Post-mortem2 min read

"Pi Day"

A planned Kubernetes control-plane upgrade introduced an incompatibility with a custom Calico (CNI) fork Reddit had been maintaining for years. Routing broke across pods. Reddit.com went fully dark for 5 hours — the long

KubernetesCNI fork5h outage

Post-mortem2 min read

systemd Update

A routine Ubuntu systemd security update, applied automatically to tens of thousands of Datadog VMs across all 5 regions, broke networking on every affected host. Datadog's observability service — the one customers trust

Auto-updatesystemdAll regions

Coordination + consensus

3 shipped

Post-mortem3 min read

43-Second Partition

A 43-second network partition between GitHub's US East and US West data centers caused MySQL Orchestrator to promote a new primary in the wrong region. 43 seconds of split-brain writes took 24 hours to reconcile. Webhook

Split brainMySQL Orchestrator43s → 24h

Post-mortem2 min read

73 Hours Down

Enabling a new streaming feature in Consul KV triggered a pathological interaction with BoltDB, Consul's embedded storage. Write latency exploded; Consul became unresponsive. Recovery took 73 hours — the worst outage in

ConsulBoltDB pathology73 hour outage

Post-mortem2 min read

Consul + HAProxy

A brief network blip between Slack's service registry (Consul) and its load balancers (HAProxy) caused a metastable failure: even after the network recovered, the system stayed broken because recovering generated more wo

Metastable failureConsulHAProxy

Capacity + cascading failure

2 shipped

Post-mortem2 min read

First-Day-Of-Year

On the first working day of 2021, everyone returning from holiday logged into Slack at once. The spike exposed a surprising limit on AWS Transit Gateway capacity. Retries from clients amplified the load into sustained sa

Surge trafficAWS TGWRetry amplification

Post-mortem2 min read

Catastrophic Regex

A single regular expression in a Cloudflare WAF rule consumed all CPU on every Cloudflare machine worldwide within seconds. 502s for everyone Cloudflare fronted — ~10% of the internet — for 27 minutes.

Regex backtrackingGlobal27 minutes

Infrastructure + network

1 shipped

Post-mortem3 min read

BGP Withdrawal

A routine maintenance command on Facebook's backbone network triggered a BGP withdrawal of all of Facebook's prefixes — making facebook.com, instagram.com, whatsapp.com unreachable from the internet for 6 hours. Internal

BGPDNS~6 hour outage