04
Canary in detail
Deploy the new version to a small subset of instances (typically 1-5%). Other instances stay on the old version. The router (LB, mesh) splits traffic by percentage.
Watch metrics on the canary instances:
- Error rate per response code.
- P50/P95/P99 latency.
- Business metrics — conversion rate, signup rate, whatever matters.
- Compare canary vs baseline. Anomalies → revert.
If healthy after a hold period (15 min - 1 hour), promote: bump canary to 10%, watch, then 25%, then 50%, then 100%. Each step is a checkpoint with automated guardrails. Spinnaker, Argo Rollouts, Flagger automate the whole ladder.
Why canary beats blue-green for most services
Blue-green is binary — all or none. Canary lets you discover problems with 1% blast radius instead of 100%. The bug that only shows up under real production load? Canary catches it before it bites everyone.