Concept · Reliability

Blue-Green & Canary Deployments

01

Why this matters

You ship 100 deploys a week. Each one risks breaking production. The naive "stop the old version, start the new version" gives you 30 seconds of total outage and a rollback that takes longer than the original deploy. Deployment strategies — blue-green, canary, rolling — let you ship continuously with zero downtime and instant rollback. Pairs with feature flags for the application-layer release dimension.

02

The four strategies

StrategyHow it worksRollbackCost
RecreateStop old, start new. Downtime during deploy.Re-run with old imageFree; bad UX
RollingReplace instances N at a time. Old + new run side by side briefly.Roll back the same way (slow)Cheap; default in K8s
Blue-greenTwo full environments. Switch traffic atomically.Switch traffic back instantly2× infra during deploy
CanaryDeploy to a small slice (1%). Watch metrics. Promote or revert.Revert before promotionMinimal extra infra
Traffic Mix Over Time — Old (gray) vs New (green) SVG
Recreate DOWNTIME · 30s Rolling gradual swap, no downtime Blue-Green 2× infra during deploy flip · instant Canary 1% → 10% → 50% → 100% with metric checks deploy starts time → old new
infra cost during blue-green
+5-10%
infra cost during canary
~5 sec
blue-green flip latency
~2 hr
typical canary 1%→100% duration
03

Blue-green in detail

You run two environments behind a router (LB, DNS, service mesh):

  • Blue — current production. Receiving 100% traffic.
  • Green — new version. Deployed but no traffic yet. Smoke tested.

Cut over: flip the router from blue → green. All new requests go to green. Done.

Rollback = flip back. Instant. Old environment was running, just not serving.

Cost: 2× infrastructure during the deploy window (often hours-long for safety). Trade money for instant rollback. Used for high-stakes critical services where rollback speed matters most.

04

Canary in detail

Deploy the new version to a small subset of instances (typically 1-5%). Other instances stay on the old version. The router (LB, mesh) splits traffic by percentage.

Watch metrics on the canary instances:

  • Error rate per response code.
  • P50/P95/P99 latency.
  • Business metrics — conversion rate, signup rate, whatever matters.
  • Compare canary vs baseline. Anomalies → revert.

If healthy after a hold period (15 min - 1 hour), promote: bump canary to 10%, watch, then 25%, then 50%, then 100%. Each step is a checkpoint with automated guardrails. Spinnaker, Argo Rollouts, Flagger automate the whole ladder.

Why canary beats blue-green for most services

Blue-green is binary — all or none. Canary lets you discover problems with 1% blast radius instead of 100%. The bug that only shows up under real production load? Canary catches it before it bites everyone.

05

Deep dive — automated canary analysis

Modern canary platforms (Flagger, Argo Rollouts) do automated promotion + rollback based on SLO compliance. The pattern:

  1. Operator pushes new version with deploy spec including SLO checks.
  2. Platform deploys to 5% of capacity.
  3. Every 30s, platform queries Prometheus: "P99 latency on canary vs baseline within 10%?" "Error rate on canary < baseline + 0.1%?"
  4. If checks pass for 10 minutes → promote to 25%.
  5. Repeat at 50%, 100%.
  6. Any check fails → automatic revert.

Operator never touches the deploy after pushing. SLO-violating changes self-revert before anyone is paged. Confidence in deploys goes from "fingers crossed" to "trust the system." Netflix and Lyft both run this pattern.

The catch: requires good SLO definitions (see SLIs/SLOs) and metric instrumentation. Garbage-in metrics = false confidence in your canary analysis.

06

Compatibility constraints

All these strategies have one painful invariant: old + new must coexist briefly. So new versions must be backwards-compatible with old versions of:

  • Database schema (don't drop a column the old code reads)
  • API contracts (don't break old clients hitting your service)
  • Message formats (old workers may still consume)

Common pattern: expand-then-contract migrations. Add a new column / endpoint / message field. Old code ignores it. Deploy new code that uses it. Once 100% on new, remove the old. Two deploys instead of one, but always backwards-compatible during transition.

07

Real-world

Netflix Spinnaker

Multi-cloud canary

Deploy to AWS + GCE simultaneously with canary analysis. Open-source; battle-tested at Netflix scale.

Argo Rollouts / Flagger

K8s-native

CRDs that turn standard deployments into canary or blue-green with automated promotion. The new default in K8s shops.

AWS CodeDeploy

Lambda + ECS canary

Native canary support for Lambda (10% then 100%) and ECS services. Native to the AWS deploy story.

Istio / Linkerd traffic shifting

Service mesh canary

Mesh handles the percentage routing. Canary is just a YAML change. Combine with feature flags for full progressive delivery.

08

Used in problems

News feed deploys ranker changes via canary with auto-promotion. E-commerce uses blue-green for checkout — instant rollback critical near peak shopping. Payment gateway is canary-only (small slice for hours, monitor everything). Notification system uses rolling for back-end workers, canary for client-facing APIs.

Next up