Post-mortem · Deploy / operator error

S3 us-east-1

A single command typo during debugging a billing system took down significant fractions of the internet — Slack, Trello, Quora, Docker Hub, and AWS's own status page — for roughly 4 hours. The blast radius caught Amazon itself by surprise.

Operator errorus-east-1~4 hour outageCascading failure

TL;DR

An S3 engineer running a playbook to debug S3 billing system latency typo'd a parameter and accidentally removed a much larger set of servers than intended — including servers running the S3 index subsystem and the placement subsystem. Both subsystems had to be fully restarted; neither had been restarted for years at that scale. Restart took hours; dependent services (including CloudFormation, Lambda, status dashboard, countless customer apps) melted with it.

Timeline

09:37 PST — Authorized S3 team member executes a capacity-reduction command intended to remove a small number of servers in the S3 billing subsystem. Typo in the input causes command to target a much wider fleet.
09:37 PST — Two independent critical subsystems become unavailable: (1) the index subsystem (handles metadata + location of all S3 objects), (2) the placement subsystem (manages allocation of storage for new objects).
09:37–11:00 — Engineers begin restarting the removed subsystems. Discover both take significantly longer to start than expected — neither had been fully restarted for many years; integrity checks on startup run for hours.
11:37 — AWS status page fails to update because it depends on S3 us-east-1.
13:54 — Index subsystem fully recovered. GET/LIST/PUT operations begin returning.
17:10 — All S3 APIs fully recovered.

Root cause

The trigger was a typo. The root cause was that (a) a single command could remove an unbounded number of servers, (b) the blast-radius of those servers was unbounded across critical subsystems, and (c) the system had never been tested end-to-end starting from cold — restart times had slowly grown over years of operation, far beyond anyone's expectation.

Amazon's language, in their own post-mortem: removing capacity had not been fully tested at this scale; the time required to restart the index and placement subsystems had not been measured.

Blast radius

Massive. Because us-east-1 is disproportionately popular (AWS's oldest region, lowest prices, many customers never bothered moving), a huge fraction of the public internet went down for hours. Affected: Slack, Trello, Medium, Quora, Business Insider, GitHub pages, Coursera, Docker Hub, Giphy. Many Lambda functions, which stored code + state in S3, also failed. AWS's own status dashboard failed because it too depended on S3 us-east-1. Estimated cost to S&P 500 firms: hundreds of millions of dollars in lost productivity.

Lessons

Capacity-removal commands need guardrails. Post-incident, AWS changed the removal tool to cap how much capacity can be removed in one action, and added explicit confirmation for removals that exceed safe thresholds.
Measure actual recovery time, not estimated. The startup time of the index subsystem had grown over years without anyone measuring it. Disaster-recovery drills that include full cold-start of critical services are the only way to know.
Status pages must not depend on the thing being monitored. AWS dashboard depending on S3 is the canonical example. Host your status page on a provider that shares no failure domains with you.
Multi-region is not optional for critical services. Customers single-homed in us-east-1 learned this the hard way. Spread read replicas and failover routes across regions.

Concepts in play

Blast radius minimization — the system lacked bounded-impact guarantees for tooling.
Incident response — status dashboards cannot share a failure domain with what they observe.
Disaster recovery — drills that test cold-start of critical subsystems.
Multi-region deployment — single-region dependencies are a ticking clock.
S3 architecture — context for the index + placement subsystems.