Concept · Operations

Disaster Recovery — RTO & RPO

01

Why this matters

Your entire AWS region goes down. Power failure, fiber cut, malicious actor, hurricane. How long until your service is back? How much data did you lose? These two questions have formal names: RTO (Recovery Time Objective) and RPO (Recovery Point Objective). Every serious system has explicit numbers for both, agreed with the business, and architecture designed to meet them.

"We have backups" is not a DR plan. "Our RTO is 1 hour, RPO is 5 minutes, tested quarterly via region failover" is.

02

RTO vs RPO

  • RTO — Recovery Time Objective. How long until the service is back up after disaster strikes. "Within 1 hour."
  • RPO — Recovery Point Objective. How much data loss you can tolerate. "At most 5 minutes of writes lost."

They're orthogonal. You can have low RTO + high RPO ("we're back fast but lost a day of orders") or high RTO + low RPO ("we lost zero data but were down 12 hours"). Most businesses want both low — and pay for it.

RTO < 1 min
multi-region active-active
RTO ~15 min
warm standby with auto-failover
RTO 1-4 hr
cold standby with restore from backup
RPO = 0
synchronous cross-region replication
RPO < 1 min
async replication, healthy network
RPO 1-15 min
scheduled snapshot + WAL shipping
RPO 1-24 hr
nightly backup only
03

The four DR tiers

TierCost overheadRTORPOHow
Backup & restore~5%4-24 hr4-24 hrDaily backups to S3 cross-region; restore on disaster
Pilot light~25%1-4 hr5-60 minMinimal infra running in DR region; scale up on disaster + replay WAL
Warm standby~50%5-30 min1-5 minFull but downscaled stack in DR region; auto-promote
Multi-site active-active~100% (2× infra)seconds0 (or near-0)Both regions serving traffic; sync replication; DNS or anycast failover
04

Choosing the right tier

Cost scales steeply. Match tier to business impact:

  • Backup & restore — internal tools, dev environments, low-stakes apps. Most startups have only this and don't realize.
  • Pilot light — moderate-importance services where 1-hour downtime is annoying but not lethal.
  • Warm standby — most B2B SaaS, e-commerce. 15-minute RTO is what enterprise customers expect.
  • Multi-site active-active — payments, healthcare, financial trading, any system whose downtime makes the news. Pay 2×.

The forcing function: SLA. If you sell 99.99% uptime (52 min/year), warm standby barely qualifies. If you sell 99.999% (5 min/year), only active-active works.

05

Deep dive — what real DR plans look like

The plan is a document with these sections:

  1. Service inventory — every component, its tier, RTO/RPO, dependencies. Everything that's not on this list is implicitly RTO=∞.
  2. Trigger conditions — what events cause DR activation. "Region health-check failure for > 5 min" → declare incident.
  3. Decision authority — who calls the failover. Specific named people + 24/7 contact.
  4. Failover runbook — step-by-step. Tested. The 3am person should not need to think.
  5. Communication plan — status page, customer email templates, leadership escalation.
  6. Data validation — how do we verify the DR region's data is intact post-failover.
  7. Failback procedure — once primary recovers, how to come home. Often more dangerous than the original failover.
  8. Test cadence — quarterly minimum. Annual full-region failover.

Critical: untested DR doesn't exist. The first time you fail over should not be in a real disaster. Game-day drills find the broken assumptions before they bite — pair this with chaos engineering.

Interview answer

"Our DR tier is warm-standby cross-region: full infra running but downscaled in us-west-2, async replication from us-east-1. RTO 15 min via Route 53 health-check failover; RPO 1 min. We game-day quarterly. Multi-site active-active for payments only — pays its 2× cost via a stricter SLA."

06

Real-world

AWS Aurora Global DB

RPO < 1s, RTO < 1 min

Cross-region replica with sub-second lag. Promote replica to primary in < 1 min. Backbone of many SaaS DR plans.

Stripe

Multi-region active-active for payments

Payments served from multiple regions simultaneously. Region loss is ~zero downtime, ~zero data loss. The most expensive tier; required by SLA.

Netflix

Multi-region streaming

Three AWS regions, all active. Region failure is invisible to viewers. Cost: ~3× infra; tradeoff: never goes down.

Most B2B SaaS

Warm standby

Realistic compromise. 15-min RTO + 1-min RPO at ~50% extra cost. Meets enterprise customer demands.

07

Used in problems

Payment gateway demands the strictest RTO/RPO. Stock exchange and trading platforms run multi-region active-active. E-commerce checkout + inventory uses warm standby with sub-minute RPO. Distributed logging trades higher RPO (hourly) for lower cost.

Next up