Post-mortems

Real failures.
Real root causes.

Real-world outages from AWS, Cloudflare, GitHub, GitLab, Meta, Slack, Roblox, Reddit, Datadog and others — each a canonical teaching case for a specific failure mode. Read before system design interviews to calibrate what actually goes wrong at scale.

12 post-mortems4 failure-mode clusterspress ⌘K to search

Deploy / operator error

6 shipped
Post-mortem3 min read

S3 us-east-1

A single command typo during debugging a billing system took down significant fractions of the internet — Slack, Trello, Quora, Docker Hub, and AWS's own status page — for roughly 4 hours. The blast radius caught Amazon

Operator errorus-east-1~4 hour outage
Post-mortem3 min read

Knight Capital

A stale deployment on 1 of 8 trading servers caused Knight's algorithm to submit millions of unintended orders in 45 minutes. Loss: $440 million. The firm was insolvent by end of day; absorbed by a competitor within week

Stale deployTrading$440M loss
Post-mortem3 min read

Accidentally Deleted

An engineer, tired and debugging a replication issue at 11pm, ran rm -rf on the production primary database directory instead of the replica. Five of six backup methods were later discovered to have been silently failing

Operator errorBackups failing silentlyData loss
Post-mortem3 min read

Kernel Driver

A content update to CrowdStrike's Falcon kernel driver caused an out-of-bounds read that crashed Windows at boot. 8.5 million Windows hosts globally went into boot loops. Airlines grounded, hospitals diverted ambulances,

Kernel driver8.5M hostsGlobal
Post-mortem2 min read

"Pi Day"

A planned Kubernetes control-plane upgrade introduced an incompatibility with a custom Calico (CNI) fork Reddit had been maintaining for years. Routing broke across pods. Reddit.com went fully dark for 5 hours — the long

KubernetesCNI fork5h outage
Post-mortem2 min read

systemd Update

A routine Ubuntu systemd security update, applied automatically to tens of thousands of Datadog VMs across all 5 regions, broke networking on every affected host. Datadog's observability service — the one customers trust

Auto-updatesystemdAll regions

Coordination + consensus

3 shipped

Capacity + cascading failure

2 shipped

Infrastructure + network

1 shipped