Post-mortem · Coordination + consensus

73 Hours Down

Enabling a new streaming feature in Consul KV triggered a pathological interaction with BoltDB, Consul's embedded storage. Write latency exploded; Consul became unresponsive. Recovery took 73 hours — the worst outage in Roblox's history. HashiCorp engineers flew in.

ConsulBoltDB pathology73 hour outageHotspot

TL;DR

Roblox enabled Consul's new "streaming" feature to reduce CPU pressure. The feature, under high write load, interacted badly with BoltDB's free-list allocator. Write latency went from <10 ms to >1 s. Consul clients flooded with stale reads and timeouts. Every service that depended on Consul for config/discovery — which was most of them — broke. Fixing required disabling streaming, repairing Raft state, and finally patching the BoltDB interaction.

Timeline

Oct 28, 13:37 PT — Roblox users begin reporting they can't connect. Backend failures widespread.
Oct 28, evening — Engineers identify Consul cluster as the failing component. Roll back the streaming feature. Write latency still high; Consul remains unstable.
Oct 29 — HashiCorp engineers engaged. Investigation into BoltDB free-list behavior — performance degraded non-linearly with free-list size.
Oct 30 — Iterative recovery: snapshot restore, Raft leadership recovery, patch applied to free-list handling. Many false starts.
Oct 31, evening — Consul stable. Services restarted + warmed up. Roblox fully operational after 73 hours.

Root cause

Roblox turned on Consul streaming, a relatively new feature, in production. Under sustained high write load, streaming triggered a BoltDB free-list pathology: the free-list grew enormous as pages were freed + allocated; scanning the free-list on each transaction took longer than the transaction itself. Consul writes slowed from sub-ms to seconds.

Once write latency climbed, Consul leaders couldn't maintain Raft heartbeat. Followers re-elected. Re-election consumed more writes. Feedback loop. Even turning streaming off couldn't immediately recover — the bloated free-list needed compaction.

Deeper cause: Roblox used Consul not just for service discovery but as the shared config + KV + locks primitive for almost every service. Single point of failure — no service could operate degraded-but-functional when Consul was unavailable.

Blast radius

73-hour full Roblox outage. ~50M daily active users. The three-day Halloween weekend — normally peak traffic — was entirely missed. Revenue impact $25M+. Reputational damage especially among kids / parents who couldn't explain why. Roblox published an unusually detailed 40-page post-mortem (now an industry reference) months later.

Lessons

New features go through isolated staging. Enabling streaming directly in production, on the cluster that ~every service depends on, is unsafe. Stage in a canary region / tenant / time-window.
Don't build on a single point of dependency. Roblox's services should have been designed to operate degraded-but-functional when Consul was down. Instead, "Consul down = nothing works."
Storage engine internals matter at scale. The BoltDB free-list pathology was invisible until sustained high write throughput. Know what your storage does when stressed.
Recovery drills for "all-services-depend-on-X is broken" scenarios. When Consul was unavailable, Roblox's service restart ordering was itself Consul-dependent. Rebooting in the right order required printing runbooks on paper.

Concepts in play

Service discovery — Consul's dominant role.
Consensus (Raft) — why elections under load multiply problems.
Storage engine internals — BoltDB as an example of free-list pitfalls.
Blast radius — avoid ecosystem-wide dependencies.
Graceful degradation — services should survive degraded Consul.