SLOs, SLIs, SLAs

01

Why this matters

"Our service is reliable" is not a claim; it's marketing. "Our service is reliable" with numbers — "99.9% of requests complete under 200ms, measured over a 30-day window" — is an engineering target. Without SLOs/SLIs/SLAs, nobody knows if the service is healthy. With them, you have objective thresholds, budgets, and escalation triggers.

Every interview about reliability or on-call goes here. Mix up the three acronyms and you've lost credibility.

02

The three acronyms, unambiguously

SLI — Service Level Indicator

What you measure

A quantitative metric: "% of requests returning 2xx within 200ms." Measured continuously. Raw data. No target, no threshold — just the number.

SLO — Service Level Objective

What you aim for internally

A target on an SLI: "99.9% of requests under 200ms over any 30-day window." Drives engineering priority. Breaking it means freeze features and fix reliability.

SLA — Service Level Agreement

What you promise customers (legally)

A contract with monetary consequences: "99.95% monthly uptime or 10% credit." Always set looser than SLO so you have headroom before penalties. SLA violations = refunds.

03

Picking good SLIs

An SLI should be user-visible — it correlates with what users actually experience. CPU usage is not an SLI; it's a technical metric that may or may not affect users. Latency and error rate are SLIs because a user feels them.

The four SLIs every service should track, from Google's SRE book:

Availability. Requests that succeeded (2xx, 3xx, 4xx-that-are-expected) / total. Don't lump 4xx with 5xx — a user sending bad input isn't a service failure.
Latency. P50, P95, P99 of successful requests. P99 matters because tail latency is user-visible frustration.
Throughput. Requests/sec. Often a target, not a threshold — capacity planning.
Correctness. Harder to quantify. "% of orders processed without manual intervention" — domain-specific.

04

Setting realistic SLOs

The mistake: picking an SLO number that sounds good ("99.99%!") without understanding what it requires. See availability nines — 99.99% means 52 minutes of total downtime per year. Every extra nine is 10× harder to achieve.

The recipe:

Measure current behavior for 30 days → you probably get something like 99.87%.
Set SLO at current level or slightly better (99.9%). Gives engineering credit for where they already are.
Set SLA at 99.5% — comfortably looser than SLO. If SLO breaches, you have time before contractual penalties.
Review SLOs quarterly. If always met, tighten. If always breached, fix reliability or loosen — match reality.

05

Deep dive — error budgets

Google SRE's key insight: 100% reliability is the wrong target. It forbids change — every deploy risks downtime. Instead, derive an error budget from your SLO.

SLO = 99.9% over 30 days → allowed downtime = 0.1% × 30 × 24 × 60 = 43.2 minutes/month. That's your budget. Spend it on deploys, experiments, infrastructure changes. If you've burned 30 minutes by week 2, you have 13 minutes left for the rest of the month — slow down deploys.

Error budget policy:

Budget > 50% remaining → ship freely.
Budget < 25% remaining → review risky changes more carefully.
Budget exhausted → freeze feature work, focus on reliability until next window.
Budget consistently under-used (say, only 10% burned most months) → you're over-engineering reliability at the expense of velocity. Loosen SLO or ship faster.

This turns reliability into a math conversation between dev and ops, not a personality conflict. "Have we met the SLO? Yes → ship. No → fix."

06

Real-world

Google SRE model

Invented error budgets

Every service has a defined SLO; every team has a budget. The SRE book is the canonical reference.

AWS S3 SLA

99.9% monthly uptime

Below 99.9% → 10% service credit. Below 99% → 25% credit. Standard template.

Stripe SLA

Enterprise tier: 99.99%

Higher tiers buy higher SLAs. Underlying system still runs the same, but the penalty scheme scales with customer commitment.

Nobl9 / Honeycomb / Datadog

SLO tooling

Platforms that compute SLIs from your metrics, track error budgets, and alert on burn rate. Much better than hand-rolling dashboards.

07

Used in problems

Distributed logging platform has strict availability SLOs (people depend on it when everything else is broken). Payment gateway publishes SLAs for enterprise tiers. Rate limiter's SLI is % of correct limit decisions (false positives and false negatives both count).

📺