Post-mortem · Infrastructure + network

BGP Withdrawal

A routine maintenance command on Facebook's backbone network triggered a BGP withdrawal of all of Facebook's prefixes — making facebook.com, instagram.com, whatsapp.com unreachable from the internet for 6 hours. Internal tools also used the same DNS; engineers locked themselves out of the systems they needed to fix the outage.

BGPDNS~6 hour outageMeta
01

TL;DR

Engineer runs a command that, due to a bug, takes down the entire backbone network instead of auditing it. Facebook's BGP advertisements get withdrawn from the internet. DNS servers within Facebook's network become unreachable because the backbone is down. DNS servers, seeing no backbone, stop announcing BGP for Facebook's authoritative DNS. Now Facebook doesn't exist on the internet. Because internal tools + door badges + phone networks depended on the same name resolution, engineers couldn't log in or even access the physical data centers to diagnose.

02

Timeline

  • 15:39 UTC — Engineer runs a command during backbone capacity assessment. Bug in audit tool causes all backbone connections to drop.
  • 15:40 UTC — Facebook's DNS resolvers become unreachable. They stop BGP-announcing the prefixes they serve.
  • 15:40 UTC — facebook.com, instagram.com, whatsapp.com return NXDOMAIN or timeout worldwide.
  • 15:40 UTC — Internal engineering tools, Workplace, company Slack, VPN, badge systems (which depend on internal DNS) become inaccessible.
  • ~17:00 UTC — Engineers physically travel to data centers. Manual on-site access to routers requires physical keys (biometric + doors also depend on internal network).
  • 21:05 UTC — Network restored; services begin returning. Thundering herd of 3B users reconnecting extends tail of outage.
  • 21:28 UTC — Most services operational again.
03

Root cause

Two cascading failures:

  1. Audit tool bug. A command intended to assess backbone capacity had a bug that caused it to shut down every backbone link. The policy system that should have rejected this command due to its blast-radius also had a bug, so it approved.
  2. DNS tightly coupled to backbone. Facebook's authoritative DNS servers were deep inside Facebook's own network. When the backbone died, DNS died with it, which made BGP retract the routes to those DNS servers, which made Facebook unroutable.
  3. Everything-internal-uses-internal-DNS. Company tools, badge readers, VPN, physical access systems — all resolved via the same DNS that was now gone.
04

Blast radius

Facebook, Instagram, WhatsApp, Messenger, Oculus, Workplace fully unavailable for ~6 hours, globally. ~3B users. Small businesses that used WhatsApp as their primary messaging couldn't operate. Market cap loss: $7B intraday. Internal impact: engineers literally unable to work. Some had to physically cut server-room locks.

05

Lessons

  1. Out-of-band access is non-negotiable. Critical-recovery tools must not share failure domains with the thing they recover. Separate admin network, cellular-backup badge readers, printed runbooks.
  2. Audit-tool guardrails. Tools that can affect backbone-sized blast radius need a second approval + automated blast-radius check before execution.
  3. Don't route DNS inside the service you're announcing. If your DNS is behind your BGP, you have a circular failure dependency. External DNS (or at minimum a secondary DNS in a totally separate network) is cheap insurance.
  4. Test disconnect scenarios. "What happens if we lose the internal network?" is a scenario that should be drilled, not discovered.
06

Concepts in play

  • DNS — the single-point of resolution that took everything down.
  • Blast radius minimization — audit tools with unbounded impact.
  • Disaster recovery — out-of-band recovery paths must exist and be tested.
  • Service mesh — tight coupling across a global network can become a single failure domain.
  • Multi-region — even geographic distribution doesn't help when the control plane is centralized.