Post-mortem · Coordination + consensus

Consul + HAProxy

A brief network blip between Slack's service registry (Consul) and its load balancers (HAProxy) caused a metastable failure: even after the network recovered, the system stayed broken because recovering generated more work than normal operation. 4+ hours of degraded Slack.

Metastable failureConsulHAProxy4h outage
01

TL;DR

Transient network issue caused Consul nodes to rapidly declare HAProxy backends unhealthy + healthy. HAProxy's config-reload-on-membership-change fired many thousand times. Each reload caused active connections to be re-established, which generated MORE health-check traffic, which caused MORE churn. Even after the network stabilized, the system couldn't catch up because the work of recovering exceeded the work of steady-state operation. Engineers had to manually dampen health-check churn.

02

Timeline

  • 06:00 PST — Transient network flap between AZs causes Consul health checks to miss ~20% of instances for a few seconds.
  • 06:00–06:05 — Consul marks thousands of backends unhealthy, then re-healthy as checks resume. Each state change triggers HAProxy config reload.
  • 06:05–10:00 — HAProxy reloads at ~30/sec. Each reload cycles active connections. Slack clients reconnect; new connections trigger health checks; more churn. System does not return to steady state even after the initial network issue resolves.
  • 06:15 PST — Engineers engage. Identify the reload storm + reconnect cascade.
  • ~09:00 PST — Engineers introduce manual dampening: hold HAProxy from reloading for N seconds after membership change; tolerate temporary stale membership.
  • ~10:30 PST — Traffic recovers as feedback loop breaks.
03

Root cause

Classic metastable failure pattern. The system had two stable states: (A) healthy operation with ~steady-state churn, (B) death-spiral with sustained churn from reconnects that self-perpetuate. A minor perturbation pushed the system over the boundary from (A) to (B); the system stayed in (B) even after the perturbation was gone.

Specifically: each HAProxy reload kills inflight connections. Clients reconnect. Reconnects trigger Consul to re-check. Health-check traffic adds load. At high reload rate, the latency of the health check increases, which causes more "unhealthy" flaps. Positive feedback loop.

04

Blast radius

~4 hours of degraded Slack globally. Messages delayed; some dropped; channel switches returned errors; file uploads failed. ~10M business users affected. Revenue + customer-trust impact; Slack issued credits under SLA terms.

05

Lessons

  1. Rate-limit membership-change reactions. Don't let a stream of "backend flapped" events trigger a stream of reloads. Batch + dampen.
  2. Reconnection storms are a separate threat model. Budget for them; ideally cap reconnect rate client-side.
  3. Metastable failures are real and specific. Often mitigated by being able to shed load: if the system is overwhelmed, drop some traffic aggressively instead of struggling.
  4. Health checks should be cheap. If health checking at rate X causes load Y that itself affects health, the check is too expensive.
06

Concepts in play