Post-mortem · Deploy / operator error

systemd Update

A routine Ubuntu systemd security update, applied automatically to tens of thousands of Datadog VMs across all 5 regions, broke networking on every affected host. Datadog's observability service — the one customers trust to tell them when they're down — went dark for 24 hours.

Auto-updatesystemdAll regions24h outage
01

TL;DR

Datadog's VMs had Ubuntu unattended-upgrades enabled. An Ubuntu security patch to systemd-networkd shipped; VMs auto-updated; the update restarted networking on each VM in a way that broke CNI config for the running K8s pods. Every region affected simultaneously because unattended-upgrades runs on a cron schedule that doesn't vary by region. Datadog, whose entire business is "observability you can trust," went dark — customers couldn't see their own systems' metrics during the outage.

02

Timeline

  • Pre-event — Datadog VMs run Ubuntu with unattended-upgrades enabled — standard security best practice.
  • ~06:00 UTC Mar 8 — Ubuntu pushes systemd-networkd security patch.
  • ~06:00 UTC — Datadog VMs across 5 regions begin applying the patch as cron runs unattended-upgrades.
  • 06:00–06:30 — systemd-networkd restart on each VM removes K8s CNI config. Pods lose networking. Datadog internal services cascade.
  • 06:30–08:00 — Datadog engineers realize scale of impact. Disable unattended-upgrades everywhere. Build + ship a fix that re-programs CNI.
  • 08:00–24:00 — Gradual region-by-region recovery. Service fully recovered ~24 hours later.
03

Root cause

The specific bug: Ubuntu's systemd patch triggered a "reload" of networking that doesn't actually cooperate with Kubernetes CNI plugins. CNI sets up virtual interfaces + iptables rules at pod-creation time. systemd-networkd reload cleared those rules. Pods now had interfaces but no routing.

The deeper cause: unattended-upgrades was running across ALL VMs simultaneously. Multi-region redundancy didn't help because the update landed everywhere at the same hour. "Multi-region" was a geography strategy, not a time-schedule strategy.

04

Blast radius

24 hours of Datadog unavailability, globally. Customers operating critical infrastructure couldn't see their own metrics, alerts, logs, or APM traces. Estimated ~18,000 paid customers impacted. Service credits + customer trust hits. The irony — a monitoring company blind during its own outage — led to significant internal process changes.

05

Lessons

  1. "Multi-region" must include "multi-time." Regional redundancy is useless against a synchronized global cause. Stagger auto-updates by region.
  2. Auto-updates on production are a double-edged sword. Security team loves them. Infra team hates them. The real answer: auto-updates with staged rollouts + the ability to pause if something breaks.
  3. Monitoring systems need special care. A monitoring platform is mission-critical to customers by definition. Its own reliability bar has to exceed the bar of the systems it monitors.
  4. Treat the OS as a dependency. OS + distribution + kernel + init system is part of your stack. Upgrades are code changes. Canary + monitor them.
06

Concepts in play