05
Deep dive — error budgets
Google SRE's key insight: 100% reliability is the wrong target. It forbids change — every deploy risks downtime. Instead, derive an error budget from your SLO.
SLO = 99.9% over 30 days → allowed downtime = 0.1% × 30 × 24 × 60 = 43.2 minutes/month. That's your budget. Spend it on deploys, experiments, infrastructure changes. If you've burned 30 minutes by week 2, you have 13 minutes left for the rest of the month — slow down deploys.
Error budget policy:
- Budget > 50% remaining → ship freely.
- Budget < 25% remaining → review risky changes more carefully.
- Budget exhausted → freeze feature work, focus on reliability until next window.
- Budget consistently under-used (say, only 10% burned most months) → you're over-engineering reliability at the expense of velocity. Loosen SLO or ship faster.
This turns reliability into a math conversation between dev and ops, not a personality conflict. "Have we met the SLO? Yes → ship. No → fix."