A single regular expression in a Cloudflare WAF rule consumed all CPU on every Cloudflare machine worldwide within seconds. 502s for everyone Cloudflare fronted — ~10% of the internet — for 27 minutes.
Regex backtrackingGlobal27 minutesWAF
01
TL;DR
Cloudflare deployed a new WAF managed rule containing a regex with catastrophic backtracking. Applied to every HTTP request at every PoP. CPU on every machine globally went to 100% within seconds; every request returned 502. Rollback took 27 minutes because the deploy system itself was impacted.
02
Timeline
13:42 UTC — WAF managed-rules update deployed globally. Contains a new rule with regex: (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|\{\}|\|\||\+)*.*(?:.*=.*)))
13:42 UTC — CPU usage across the entire fleet begins climbing. Within 20 seconds, most machines pinned at 100%.
13:45 UTC — Internal alerts fire. SRE team engages. Identifies WAF process as CPU consumer.
13:47 UTC — Cloudflare's own dashboard + API inaccessible because they proxy through the same infrastructure.
13:57 UTC — Kill switch (a pre-built capability to disable the WAF globally) engaged. CPU drops; requests flow again.
14:09 UTC — Normal service restored after deploy-system catch-up.
03
Root cause
The regex exhibited exponential backtracking on specific input patterns. With the input "x=x", the engine explored all possible ways to match the alternations; each additional character doubled the work. A 30-character crafted-like input would run for hours.
The rule was tested on the rule-writer's laptop (fast enough to not notice), then deployed to production via the normal managed-rules pipeline. No test fixture caught the pathological case; no CI CPU-time ceiling existed for WAF rules.
Deeper root cause: PCRE/re2 library choice. PCRE-style backtracking engines are vulnerable to ReDoS; Go's re2 (and Rust's regex crate) guarantee linear-time matching. Cloudflare was using PCRE.
04
Blast radius
Every single HTTP request going through Cloudflare returned 502 for ~27 minutes. That's an estimated ~10% of the web at the time. Discord, Medium, Coinbase, DigitalOcean, Postmates, countless others dark. Financial impact to Cloudflare customers collectively: hundreds of millions of dollars in lost business. Cloudflare's own dashboard inaccessible during the incident, slowing response.
05
Lessons
Regex engine choice has massive security implications. PCRE/Perl/Java/JavaScript engines can backtrack exponentially. re2/Rust's engine is guaranteed linear-time. If the regex is untrusted (user input OR rules written by many authors), use a linear-time engine.
Kill switches are non-negotiable for global services. Cloudflare had one, which is why recovery was 27 minutes, not hours. Every critical subsystem should answer "what's the one-click escape hatch?"
Canary deploys for WAF rule changes. Deploy to 1% of PoPs for N minutes; watch CPU; promote. Cloudflare added this post-incident; the managed-rules pipeline now canary-staged.
CPU budget per request. WAF rules that consume >X ms get killed and logged. Preventive bound instead of reactive kill switch.