Post-mortem · Coordination + consensus

Consul + HAProxy

A brief network blip between Slack's service registry (Consul) and its load balancers (HAProxy) caused a metastable failure: even after the network recovered, the system stayed broken because recovering generated more work than normal operation. 4+ hours of degraded Slack.

Metastable failureConsulHAProxy4h outage

TL;DR

Transient network issue caused Consul nodes to rapidly declare HAProxy backends unhealthy + healthy. HAProxy's config-reload-on-membership-change fired many thousand times. Each reload caused active connections to be re-established, which generated MORE health-check traffic, which caused MORE churn. Even after the network stabilized, the system couldn't catch up because the work of recovering exceeded the work of steady-state operation. Engineers had to manually dampen health-check churn.

Timeline

06:00 PST — Transient network flap between AZs causes Consul health checks to miss ~20% of instances for a few seconds.
06:00–06:05 — Consul marks thousands of backends unhealthy, then re-healthy as checks resume. Each state change triggers HAProxy config reload.
06:05–10:00 — HAProxy reloads at ~30/sec. Each reload cycles active connections. Slack clients reconnect; new connections trigger health checks; more churn. System does not return to steady state even after the initial network issue resolves.
06:15 PST — Engineers engage. Identify the reload storm + reconnect cascade.
~09:00 PST — Engineers introduce manual dampening: hold HAProxy from reloading for N seconds after membership change; tolerate temporary stale membership.
~10:30 PST — Traffic recovers as feedback loop breaks.

Root cause

Classic metastable failure pattern. The system had two stable states: (A) healthy operation with ~steady-state churn, (B) death-spiral with sustained churn from reconnects that self-perpetuate. A minor perturbation pushed the system over the boundary from (A) to (B); the system stayed in (B) even after the perturbation was gone.

Specifically: each HAProxy reload kills inflight connections. Clients reconnect. Reconnects trigger Consul to re-check. Health-check traffic adds load. At high reload rate, the latency of the health check increases, which causes more "unhealthy" flaps. Positive feedback loop.

Blast radius

~4 hours of degraded Slack globally. Messages delayed; some dropped; channel switches returned errors; file uploads failed. ~10M business users affected. Revenue + customer-trust impact; Slack issued credits under SLA terms.

Lessons

Rate-limit membership-change reactions. Don't let a stream of "backend flapped" events trigger a stream of reloads. Batch + dampen.
Reconnection storms are a separate threat model. Budget for them; ideally cap reconnect rate client-side.
Metastable failures are real and specific. Often mitigated by being able to shed load: if the system is overwhelmed, drop some traffic aggressively instead of struggling.
Health checks should be cheap. If health checking at rate X causes load Y that itself affects health, the check is too expensive.

Concepts in play

Metastable failures — exactly this pattern.
Health checks + failure detection — aggressive checks cause their own failures.
Load balancer — HAProxy reload behavior.
Service discovery — Consul's role as source of truth.
Backpressure — load shedding is the escape hatch.