Concept · Foundations

Availability — The Nines

01

Why this matters

"Four nines" (99.99%) sounds marginally better than "three nines" (99.9%). The truth: four nines allows 52 minutes of downtime per year; three nines allows 8.8 hours. That's a 10× difference in how much engineering you put into redundancy, failover, and testing. Getting this wrong in a design discussion means either wildly overspending or promising something you can't deliver.

02

The downtime table

SLO% uptimeDowntime / yearDowntime / monthDowntime / week
99%Two nines3.65 days7.3 hours1.7 hours
99.9%Three nines8.77 hours43.8 min10.1 min
99.95%Three-and-a-half4.38 hours21.9 min5 min
99.99%Four nines52.6 min4.4 min1 min
99.999%Five nines5.26 min26.3 sec6.05 sec
99.9999%Six nines31.5 sec2.6 sec0.6 sec

Each extra nine is 10× harder. Five nines means your maintenance window is 5 minutes a year — you can't reboot anything. You have to design for zero-downtime deploys, automatic failover, and the system has to stay up through any single-component failure.

03

How availability actually composes

Serial dependencies multiply. If your API calls a database (99.9%) and a cache (99.9%), the API's ceiling is 99.8% — you can only be as good as the product of all dependencies in the hot path.

Parallel redundancy adds nines. Two servers each at 99% uptime, independent failures, both required to fail → combined uptime = 1 − (0.01)² = 99.99%. Three independent 99% components in parallel → 99.9999%. This is why redundancy works.

The dependency rule: if you depend on a service, your SLO ≤ their SLO. AWS S3 offers 99.99% durability SLO but only 99.9% availability SLO — if your app calls S3 on every request, you cannot claim 99.99% availability no matter what you do.

04

SLO vs SLA vs SLI

SLI — indicator

What you measure

A metric: "% of requests returning 2xx within 200ms." Measured continuously. Raw data.

SLO — objective

What you aim for internally

A target on an SLI: "99.9% of requests < 200ms over any 30-day window." Drives engineering priority. Breaking it is the signal to stop shipping features and fix reliability.

SLA — agreement

What you promise customers in writing

A contract with monetary consequences if you miss it: "99.95% monthly uptime or 10% credit." Always set looser than SLO so you have headroom before penalties kick in.

05

Deep dive — error budgets

Google SRE's key insight: 100% availability is the wrong target. It forbids change — you can't ship new code if any deploy risks downtime. Instead, your SLO implies an error budget — the amount of downtime you're allowed.

If your SLO is 99.9% monthly, your error budget is 43.8 minutes/month. If you burn through it in week 1 (major outage), you freeze feature work for the next 3 weeks and focus entirely on reliability. If week 4 ends with 20 minutes unused, you should ship a risky change just to use the budget — excess uptime means you're moving too slowly.

This flips the usual dev vs ops fight. Dev wants to ship; ops wants stability. Error budget turns it into math: have budget → ship away. Out of budget → stop shipping, fix stability. Both sides have an aligned incentive.

Interview answer

"We target 99.9% availability. Error budget is 43 minutes/month. We track request-level SLIs and expose a burn-rate dashboard. When we've burned 50% of the budget in 10% of the window, we freeze non-safety deploys."

06

What each tier costs

Three nines

Single region, multi-AZ

Standard managed services (RDS, EKS) with multi-AZ failover. Achievable with moderate engineering. Most startups.

Four nines

Multi-region active-passive

Automated failover to another region. Full data replication. Chaos testing. Runbooks for every failure. Medium-sized companies aiming for SaaS enterprise deals.

Five nines

Multi-region active-active

Traffic served from ≥2 regions simultaneously. Conflict resolution for writes. Typically N+2 redundancy at every layer. Payment networks, stock exchanges, emergency services.

Six+ nines

Specialized hardware

Not achievable with commodity servers. Think Stratus fault-tolerant boxes, aerospace. Irrelevant for most interview discussions.

07

Used in problems

Payment gateway and stock exchange problems target four or five nines — the rest typically aim for three. WhatsApp's voice/video must stay above 99.99% regionally.

Next up