Post-mortem · Deploy / operator error

"Pi Day"

A planned Kubernetes control-plane upgrade introduced an incompatibility with a custom Calico (CNI) fork Reddit had been maintaining for years. Routing broke across pods. Reddit.com went fully dark for 5 hours — the longest outage in Reddit's history to date.

KubernetesCNI fork5h outageTech debt

TL;DR

Reddit upgraded production Kubernetes from 1.23 to 1.24 on Pi Day. The upgrade changed how node labels were set; Reddit's custom Calico fork relied on the old labeling convention. New pods failed to receive correct IP routes. Existing pods kept working but could not be replaced or scaled. The team rolled back — but rolling back K8s in-place is a multi-hour procedure; in the meantime most service replicas aged out and Reddit.com went dark.

Timeline

19:25 UTC — K8s upgrade 1.23 → 1.24 begins on primary cluster.
19:35 UTC — Upgrade completes. Existing pods still running; everything looks fine.
19:45 UTC — First signs of trouble: new pods failing readiness checks. Pod-to-pod traffic failing. Sporadic page load failures.
20:00 UTC — Reddit.com becoming inconsistent. Some users seeing errors, some not, depending on which pods they route to.
20:20 UTC — Scope of problem clear: CNI routing broken; all new pods unusable. Decision to roll back.
20:30–00:30 UTC — Rolling back K8s control plane. Rolling back node pool. Calico config restored to pre-upgrade state. Pods recovering gradually.
00:30 UTC Mar 15 — Reddit fully operational. Total ~5-hour outage.

Root cause

Reddit had forked Calico (a K8s CNI plugin) years prior to add custom behavior. The fork was never upstreamed. Over time, the team made small fixes to keep it working against newer K8s; by K8s 1.24 the drift between upstream Calico and Reddit's fork was sufficient that the new K8s behavior broke the fork.

Specifically: K8s 1.24 changed how node labels set on kubelet startup. Reddit's Calico fork read labels at init-time with the old convention. New nodes came up with labels the fork couldn't understand. Routing rules never got programmed.

The upgrade had been tested in staging — but staging's Calico version had been accidentally updated to a cleaner state than production's, so the bug didn't reproduce.

Blast radius

Reddit.com fully inaccessible for 5 hours. ~50M daily active users affected. During downtime, Reddit's competitor subreddits on Discord / Twitter filled with memes about the outage. Ad revenue impact ~$1M+. Increased pressure on Reddit's reliability narrative going into their IPO year.

Lessons

Long-lived forks of critical infrastructure are unpaid tech debt. Every fork that doesn't upstream becomes harder to upgrade each release. Either upstream, or adopt a supported config option, or accept the fork as a maintenance cost.
Staging must match production, version-by-version. The bug was in production's stale Calico. Staging was cleaner. This is the classic "works on staging" failure mode.
K8s upgrades are a risk surface. Draining + upgrading the control plane itself affects all workloads. Run canary clusters first; never upgrade prod and staging simultaneously.
Plan rollback before the upgrade. Rolling back K8s is harder than rolling forward. Having a practiced, fast rollback path is necessary.

Concepts in play

Kubernetes — upgrade risk surface.
Canary + rollback — staged rollouts for infrastructure changes.
Feature flags — an abstraction that sometimes even applies to infra upgrades.
Immutable infrastructure — forks resist upgrades.
Testing in production — staging must match prod.