Post-mortem · Capacity + cascading failure

First-Day-Of-Year

On the first working day of 2021, everyone returning from holiday logged into Slack at once. The spike exposed a surprising limit on AWS Transit Gateway capacity. Retries from clients amplified the load into sustained saturation for hours.

Surge trafficAWS TGWRetry amplification5h outage

TL;DR

Slack's VPCs connect through AWS Transit Gateway (TGW). On Jan 4, traffic surged as users returned from holiday. TGW was running into per-gateway bandwidth limits that autoscaling couldn't catch up to fast enough. Failed requests caused clients to retry (and WebSocket clients to reconnect), which multiplied the load. The system stayed saturated for hours until engineers + AWS together scaled the infrastructure past the surge.

Timeline

06:00 PST — US East Coast workers start logging into Slack after holiday. Traffic ~2× normal start-of-day.
07:00 PST — AWS Transit Gateway capacity in Slack's primary VPC saturates. Packets dropped. Slack's service-to-service traffic degrades.
07:15 PST — Slack engineers paged. Identify TGW saturation.
07:30 PST — Clients seeing errors begin aggressive retry cycles. Increased load compounds the saturation.
~09:00 PST — Slack + AWS engineers coordinate TGW scaling. Multiple scaling events required; autoscaling alone couldn't keep up.
~12:00 PST — Traffic back to nominal. Total outage ~5 hours of degraded service.

Root cause

Traffic spike from holiday return + multiple secondary issues:

TGW per-gateway bandwidth limits. Documented, but higher than Slack's normal peak, so not routinely tested.
Autoscaling lag. TGW scaling events take minutes, not seconds; surge exceeded scaling rate.
Retry amplification. Clients timing out retried repeatedly. WebSocket clients reconnected. Each retry added more saturation pressure.
Latency-sensitive internal services cascaded. Authentication service depended on VPC-to-VPC calls. When those slowed, login storms intensified.

Blast radius

~5 hours of degraded service across Slack's main geos. Users saw "connecting…" loops, messages undelivered, channel switches failing. Estimated ~10M paid users impacted during working hours.

Lessons

Load-test at the boundary limits of your cloud primitives. Normal autoscaling ≠ surge readiness. Know your peak; stress to 2×.
Retry storms need client-side jittered backoff with a hard cap. Otherwise every server-side outage is 2–5× worse than baseline traffic.
Anticipate predictable surges. First day of year, first Monday after DST, Sunday night for Monday standups. Pre-scale before, not after.
Exercise capacity with chaos + synthetic load. Not just prod-like load tests; specifically test the path-through-TGW-under-saturation scenario.

Concepts in play

Autoscaling — scaling rate vs surge rate.
Retry + backoff — retry amplification is a classic pitfall.
Thundering herd — reconnect storms post-outage.
Backpressure — shedding load protects what can be saved.
Capacity planning — cloud limits must be modeled.