Post-mortem · Deploy / operator error

Kernel Driver

A content update to CrowdStrike's Falcon kernel driver caused an out-of-bounds read that crashed Windows at boot. 8.5 million Windows hosts globally went into boot loops. Airlines grounded, hospitals diverted ambulances, banks and retailers shut down. Each affected machine required manual physical recovery in safe mode.

Kernel driver8.5M hostsGlobalManual recovery
01

TL;DR

CrowdStrike Falcon ships a kernel-mode driver on every managed Windows host. On July 19, 2024, a routine "channel file" content update — NOT a full driver release, but a signature-file update pushed hourly — contained a malformed template that caused the driver to read past a buffer, crashing the Windows kernel on boot. Because the driver loaded early in boot, rebooting didn't help. Every affected machine needed manual intervention: boot to safe mode, delete the specific file, reboot. Estimated damages: $10B+.

02

Timeline

  • 04:09 UTC — CrowdStrike deploys channel file 291 (a routine update) to all customers.
  • 04:09–05:30 UTC — Millions of Windows hosts worldwide begin bluescreening at boot. Global services dependent on those hosts — airline check-in, hospital EHRs, TV stations, banks — go offline.
  • 05:27 UTC — CrowdStrike pushes a corrected channel file, but hosts that already crashed cannot receive it (they're not booting).
  • 07:00 UTC — CrowdStrike publishes workaround: boot into safe mode, delete C:\Windows\System32\drivers\CrowdStrike\C-00000291*.sys, reboot.
  • Jul 19–25 — Individual hosts recovered manually by IT teams worldwide. Many were BitLocker-encrypted, requiring recovery keys per machine.
03

Root cause

The channel file format uses 21 input fields. Falcon's parser handled 20; the file contained 21. An indexing bug reading the 21st field dereferenced memory out of bounds. In kernel mode, this crashes the OS instantly.

Deeper causes:

  1. No canary for content updates. Full driver binaries went through staged rollouts. But "channel files" were considered data, not code, and deployed to 100% of customers simultaneously. (Channel files are clearly code — they drive executable parsing.)
  2. No in-process content validator. The driver assumed channel files were well-formed.
  3. Kernel mode with no recovery path. A user-mode component can crash, restart, and skip a bad update. A kernel driver cannot.
04

Blast radius

~8.5 million Windows hosts. Delta Airlines alone: 7,000 cancelled flights, $500M loss. Healthcare: surgeries postponed; ambulances diverted. Payments: Stripe + others saw partial disruption. Media: Sky News offline; US local TV stations unable to broadcast. Estimated global economic impact $10B+, claimed by insurers. One of the largest tech-induced outages in history, behind only telco-network failures by count of affected users.

05

Lessons

  1. "Content" and "code" distinction is invalid if content drives execution. Anything that changes runtime behavior needs the rigor of code — tests, canaries, rollback plans.
  2. Kernel modules are the last place for risky deploys. Every kernel-mode component needs a dual-mode recovery: boot into a safe mode that skips the module, and a user-mode equivalent where possible.
  3. Progressive rollouts, always. 1% → 10% → 100% with bake time and metrics gates. Even "small" updates. CrowdStrike has since adopted staged deploys for all channel files.
  4. Don't design out of remote recovery. BitLocker + no-boot + physical access requirement meant recovery couldn't be automated centrally. A simple "safe-mode boot + pull latest config" recovery path would have been hours instead of days.
06

Concepts in play