05
Deep dive — error budgets
Google SRE's key insight: 100% availability is the wrong target. It forbids change — you can't ship new code if any deploy risks downtime. Instead, your SLO implies an error budget — the amount of downtime you're allowed.
If your SLO is 99.9% monthly, your error budget is 43.8 minutes/month. If you burn through it in week 1 (major outage), you freeze feature work for the next 3 weeks and focus entirely on reliability. If week 4 ends with 20 minutes unused, you should ship a risky change just to use the budget — excess uptime means you're moving too slowly.
This flips the usual dev vs ops fight. Dev wants to ship; ops wants stability. Error budget turns it into math: have budget → ship away. Out of budget → stop shipping, fix stability. Both sides have an aligned incentive.
Interview answer
"We target 99.9% availability. Error budget is 43 minutes/month. We track request-level SLIs and expose a burn-rate dashboard. When we've burned 50% of the budget in 10% of the window, we freeze non-safety deploys."