Concept · Operations

Multi-Region — Active-Active vs Active-Passive

01

Why this matters

One AWS region fails — your service is down for 30 minutes. Multi-region deployment fixes this, but how? You can run the second region passively (cold or warm standby that takes traffic only on disaster) or actively (both regions serve users continuously). The choice cascades through every architectural decision: replication strategy, consistency model, DNS, costs.

Active-active sounds obviously better — until you confront writes-from-two-places, conflict resolution, and 2× the infra bill. Most teams who say they want active-active actually want the easier option done well.

02

The four configurations

PatternBoth regions serving?RTOCost vs single-region
Active-coldNo · second region built on demand4-24 hr~5% (backup storage only)
Active-pilotNo · minimal infra warm1-4 hr~25%
Active-passive (warm standby)No · full standby ready5-30 min~50-80%
Active-activeYes · both serve real trafficseconds~100% (2× infra)
03

Active-passive — the easy mode

Region A serves all traffic. Region B has identical infrastructure but no traffic — async replication keeps its data current. On disaster:

  1. Health check fails (Route 53 or external monitor).
  2. DNS updated to point to region B (or anycast routing kicks in).
  3. Region B's database is promoted from replica to primary.
  4. Service comes online. RTO: 5-30 min depending on automation.

Wins: no write conflicts — only one region accepts writes at a time. Standard relational databases work fine. Conceptually simple; most engineering teams can ship this.

Catches: warm standby may not actually work when needed (untested code paths). Async replication means RPO > 0 — some recent writes can be lost. Failback (returning to A after recovery) is operationally tricky.

04

Active-active — the hard mode

Both regions serve. Users in EU hit EU region; users in US hit US region. Lower latency for everyone. Region failure is invisible to most users.

The hard part: writes happen in both regions simultaneously. Conflict resolution becomes mandatory. Three approaches:

  • Sharded by user/region — user "always lives" in one region. Writes for user X always go to that region. Other region holds a replica for reads. Reduces conflicts to near-zero. Practical default.
  • CRDTs / multi-master with merge — accept that conflicts happen, design data types that merge automatically. Vector clocks + LWW or richer CRDTs. Works for some data shapes (counters, sets); not for ordered transactions.
  • Sync cross-region writes — every write coordinates across regions before commit. Paxos/Raft across regions. Spanner does this. Latency cost: 100-200ms per write.

Picking depends on workload. Banking can't tolerate conflict resolution → sync sharded by account. Social networks can → user-level sharding with CRDT counters for likes. Cassandra-backed services use LWW + accept some lost writes.

User Sharding for Active-Active Mermaid
flowchart LR EU_user[User in EU] -->|writes| EU_region[(EU region
primary for EU users)] US_user[User in US] -->|writes| US_region[(US region
primary for US users)] EU_region -.async replication.-> US_region US_region -.async replication.-> EU_region EU_user -.reads (own data, local).-> EU_region US_user -.cross-region reads only when needed.-> EU_region
05

When each pattern wins

  • Active-passive for: most B2B SaaS, internal tools, anything where 15 min downtime is OK if rare. Lower complexity, lower cost.
  • Active-active sharded for: global consumer apps with locality (Slack workspaces, Dropbox accounts, Notion teams). Mostly each user lives in one region.
  • Active-active CRDT for: counters, dashboards, soft state. Eventual consistency tolerated.
  • Active-active synchronous for: financial systems where consistency > availability. Pay the latency.
Common mistake

Picking active-active for the marketing reasons ("we never go down") without designing for write conflicts. Result: silent data corruption that takes months to discover. Default to active-passive unless you have a real conflict-handling story.

06

Real-world

Stripe

Active-active sharded by merchant

Each merchant lives in one region. Cross-region failover for that merchant only happens on disaster. Writes never conflict in normal operation.

Slack

Active-active sharded by workspace

Workspaces pinned to regions. Compliance-sensitive workspaces (EU) must stay in EU. Failover is workspace-level.

DynamoDB Global Tables

Multi-region active-active CRDT

LWW conflict resolution baked in. Writes sub-second across regions. Suitable for soft state; not for invariants.

Spanner

Active-active synchronous

Paxos across regions for every write. Strict consistency, ~100-200ms write latency. Payment-network grade.

07

Used in problems

WhatsApp uses active-active sharded by user_id. News feed uses active-active for reads, primary-region for writes. Payment gateway runs active-active synchronous for the ledger. E-commerce typically warm-passive — failover acceptable if rare.

Next up