Post-mortem · Coordination + consensus

43-Second Partition

A 43-second network partition between GitHub's US East and US West data centers caused MySQL Orchestrator to promote a new primary in the wrong region. 43 seconds of split-brain writes took 24 hours to reconcile. Webhooks, Pages, GraphQL API all impacted.

Split brainMySQL Orchestrator43s → 24hConsensus

TL;DR

Routine maintenance on a 100G network link caused a 43-second partition between GitHub's US East (primary) and US West (replica) regions. MySQL Orchestrator, using Raft for consensus on which node is primary, decided during the partition that US West should be primary — while US East believed itself still primary. Writes landed in both regions. When the partition healed, conflict resolution took 24 hours of manual + scripted replay; webhooks, Pages, search, and parts of the API remained impacted the entire time.

Timeline

22:52 UTC Oct 21 — Routine network maintenance causes 43-second connectivity loss between US East and US West.
22:52:26 UTC — Orchestrator (using Raft) in US West loses quorum with East. Initiates failover per config: promotes a US West replica to primary.
22:53:09 UTC — Network heals. Two primaries now exist for the same MySQL cluster. Both receive writes.
22:54 UTC — Orchestrator detects split brain; stops all automated operations. Engineers paged.
23:13 UTC — GitHub public status updated: degraded service.
Oct 22, 01:00–23:00 UTC — Manual reconciliation. East-originating writes during the 43-second window were not replicated cleanly; had to be compared + replayed or discarded per-record. Services kept in degraded mode to prevent more write divergence.
Oct 23, ~00:00 UTC — Full service restored.

Root cause

Orchestrator was configured to prefer promoting replicas in US West even when the primary was in US East, because the Raft quorum happened to sit more in US West. The automated failover fired before humans could evaluate whether the partition was a brief blip or a real outage. The MySQL topology allowed a brief-blip-driven demotion/promotion to happen safely from a leader-election perspective but NOT from a data-consistency perspective — because writes had been landing in US East right up until the 43-second cut, and those were not yet replicated west at cut-time.

Deeper cause: topology assumed a partition always meant a real outage, triggering aggressive auto-failover. For a 43-second blip, auto-failover cost more than doing nothing would have.

Blast radius

~24 hours of degraded service affecting all GitHub users. Webhook deliveries delayed by hours. Pages + GraphQL partially unavailable. Pull-request creation, new issue creation, and some repo operations returned errors or inconsistent state. No data was ultimately lost, but reconciliation was painful. Engineers did not enable writes across the cluster until every divergent record had been inspected.

Lessons

Auto-failover during brief partitions is worse than degraded service. GitHub raised the partition-duration threshold before Orchestrator can promote.
Cross-region primary should require human approval. Promoting across a slow WAN link has expensive consequences on write latency + consistency guarantees. Post-incident, Orchestrator was configured never to promote cross-region without manual approval.
Split-brain recovery is not a fast automated process. Build the runbooks + tooling for manual reconciliation BEFORE you need them.
Reduce the auto-failover blast radius. Automated failover should be limited to within-region replacements where replication lag is microseconds, not cross-region where replication lag is milliseconds.

Concepts in play

Consensus (Raft) — Orchestrator uses Raft, which guarantees leader election but doesn't guarantee data integrity across old-primary writes.
Split brain — textbook case.
Replication — async replication lag is the root of reconciliation pain.
Multi-region — cross-region primaries trade availability for latency.
Failure detection — too-aggressive timeouts cause flapping.