Blue-Green & Canary Deployments

01

Why this matters

You ship 100 deploys a week. Each one risks breaking production. The naive "stop the old version, start the new version" gives you 30 seconds of total outage and a rollback that takes longer than the original deploy. Deployment strategies — blue-green, canary, rolling — let you ship continuously with zero downtime and instant rollback. Pairs with feature flags for the application-layer release dimension.

02

The four strategies

Strategy	How it works	Rollback	Cost
Recreate	Stop old, start new. Downtime during deploy.	Re-run with old image	Free; bad UX
Rolling	Replace instances N at a time. Old + new run side by side briefly.	Roll back the same way (slow)	Cheap; default in K8s
Blue-green	Two full environments. Switch traffic atomically.	Switch traffic back instantly	2× infra during deploy
Canary	Deploy to a small slice (1%). Watch metrics. Promote or revert.	Revert before promotion	Minimal extra infra

Traffic Mix Over Time — Old (gray) vs New (green) SVG

2×

infra cost during blue-green

+5-10%

infra cost during canary

~5 sec

blue-green flip latency

~2 hr

typical canary 1%→100% duration

03

Blue-green in detail

You run two environments behind a router (LB, DNS, service mesh):

Blue — current production. Receiving 100% traffic.
Green — new version. Deployed but no traffic yet. Smoke tested.

Cut over: flip the router from blue → green. All new requests go to green. Done.

Rollback = flip back. Instant. Old environment was running, just not serving.

Cost: 2× infrastructure during the deploy window (often hours-long for safety). Trade money for instant rollback. Used for high-stakes critical services where rollback speed matters most.

04

Canary in detail

Deploy the new version to a small subset of instances (typically 1-5%). Other instances stay on the old version. The router (LB, mesh) splits traffic by percentage.

Watch metrics on the canary instances:

Error rate per response code.
P50/P95/P99 latency.
Business metrics — conversion rate, signup rate, whatever matters.
Compare canary vs baseline. Anomalies → revert.

If healthy after a hold period (15 min - 1 hour), promote: bump canary to 10%, watch, then 25%, then 50%, then 100%. Each step is a checkpoint with automated guardrails. Spinnaker, Argo Rollouts, Flagger automate the whole ladder.

Why canary beats blue-green for most services

Blue-green is binary — all or none. Canary lets you discover problems with 1% blast radius instead of 100%. The bug that only shows up under real production load? Canary catches it before it bites everyone.

05

Deep dive — automated canary analysis

Modern canary platforms (Flagger, Argo Rollouts) do automated promotion + rollback based on SLO compliance. The pattern:

Operator pushes new version with deploy spec including SLO checks.
Platform deploys to 5% of capacity.
Every 30s, platform queries Prometheus: "P99 latency on canary vs baseline within 10%?" "Error rate on canary < baseline + 0.1%?"
If checks pass for 10 minutes → promote to 25%.
Repeat at 50%, 100%.
Any check fails → automatic revert.

Operator never touches the deploy after pushing. SLO-violating changes self-revert before anyone is paged. Confidence in deploys goes from "fingers crossed" to "trust the system." Netflix and Lyft both run this pattern.

The catch: requires good SLO definitions (see SLIs/SLOs) and metric instrumentation. Garbage-in metrics = false confidence in your canary analysis.

06

Compatibility constraints

All these strategies have one painful invariant: old + new must coexist briefly. So new versions must be backwards-compatible with old versions of:

Database schema (don't drop a column the old code reads)
API contracts (don't break old clients hitting your service)
Message formats (old workers may still consume)

Common pattern: expand-then-contract migrations. Add a new column / endpoint / message field. Old code ignores it. Deploy new code that uses it. Once 100% on new, remove the old. Two deploys instead of one, but always backwards-compatible during transition.

07

Real-world

Netflix Spinnaker

Multi-cloud canary

Deploy to AWS + GCE simultaneously with canary analysis. Open-source; battle-tested at Netflix scale.

Argo Rollouts / Flagger

K8s-native

CRDs that turn standard deployments into canary or blue-green with automated promotion. The new default in K8s shops.

AWS CodeDeploy

Lambda + ECS canary

Native canary support for Lambda (10% then 100%) and ECS services. Native to the AWS deploy story.

Istio / Linkerd traffic shifting

Service mesh canary

Mesh handles the percentage routing. Canary is just a YAML change. Combine with feature flags for full progressive delivery.

08

Used in problems

News feed deploys ranker changes via canary with auto-promotion. E-commerce uses blue-green for checkout — instant rollback critical near peak shopping. Payment gateway is canary-only (small slice for hours, monitor everything). Notification system uses rolling for back-end workers, canary for client-facing APIs.

📺

References & Videos

Deployment Strategies Explained

ByteByteGo · 8 min

Blue-Green, Canary, Rolling Deployments

TechWorld with Nana · 15 min

Deployment Strategies

AlgoMaster

Canary Release

Martin Fowler