Concept · Reliability

Chaos Engineering

01

Why this matters

You think your system handles a database failover. The runbook says it does. The diagram says it does. But you've never actually tested it. The first time it happens for real — at 3am, with no warning — you discover the standby has stale data, the connection pool doesn't reconnect, and the load balancer takes 90s to notice. Every distributed system carries this kind of silent fragility.

Chaos engineering is the practice of deliberately injecting failures in production (or production-like environments) to find these gaps before they find you. Pioneered by Netflix as Chaos Monkey; now mainstream at every serious production org.

02

The principle

You cannot prove a system is resilient by reasoning about it. You can only prove it by breaking it. Once a week, randomly kill a server. Once a month, simulate a region outage. Did the system survive? If yes, your resilience is real. If no, you fix the gap during business hours instead of at 3am.

Counter-intuitively, the safer way to operate is to continuously create small failures so the system is always being exercised against them. Failures that happen weekly are well-understood; failures that happen yearly are catastrophes.

03

The four-step loop

  1. Hypothesize a steady state. "Latency P99 stays under 200ms; error rate stays under 0.1%."
  2. Inject failure. Kill a node. Block network to a region. Inject 500ms latency on a service-to-service call. Simulate disk-full.
  3. Observe. Did the steady-state hold? Did circuit breakers trip? Did the fallback engage?
  4. Fix any gap. The system should have absorbed this. If it didn't, that's a bug — file it, fix it, rerun the experiment to verify.
04

Netflix's Simian Army (and successors)

ToolWhat it doesWhat it teaches
Chaos MonkeyRandomly terminates instances during business hours"Can my service survive a single node failure?"
Chaos GorillaSimulates an availability-zone outage"Does my service auto-failover to other AZs?"
Chaos KongSimulates a full region outage"Does my multi-region failover plan actually work?"
Latency MonkeyAdds artificial latency to service calls"Are my timeouts tight enough? Are circuit breakers wired?"
Conformity MonkeyDetects instances not matching standards (no health endpoint, etc.)"Is my fleet uniform enough to recover automatically?"
05

Deep dive — running chaos in production safely

"Inject failures in production" sounds reckless. Done right, it's the safest possible choice. Rules:

  • Start small. Begin with one instance in a non-critical service. Tiny blast radius.
  • Define abort conditions. "If error rate exceeds 1%, stop the experiment." Automated, not manual.
  • Run during business hours. Engineers awake, ready to react. Not 3am.
  • Communicate. Slack the on-call team before each experiment. No surprises.
  • Build a game-day cadence. Once a quarter, planned multi-team exercises that simulate large-scale failures (region down, dependency dead). Whole org practices the response.
  • Increase blast radius gradually. Single instance → single rack → single AZ → region. Each promotion only after the previous tier is reliable.

Cost: experiments that uncover bugs are cheap relative to the outages they prevent. Netflix attributes much of its industry-leading uptime to this practice. The chaos team is funded by saved-incident dollars.

Interview answer

"We run continuous chaos: random instance termination weekly, AZ-failover drills monthly, region failover quarterly. Each finds 1-2 production bugs we wouldn't have caught otherwise. Our uptime is high because the alternative — discovering these failures during a real outage — is unacceptable."

06

Real-world

Netflix Chaos Monkey

Where it started

Open-sourced 2012. Killed instances during business hours. Forced every Netflix service to handle node failure as a normal event.

Gremlin / LitmusChaos

Managed chaos platforms

SaaS for chaos engineering. Inject latency, packet loss, CPU pressure, disk pressure. Used by teams that don't want to build their own.

AWS Fault Injection Simulator

Cloud-native chaos

Native AWS service for orchestrating failure scenarios. Stop EC2, throttle network, fail over RDS. Audit-friendly.

Google DiRT

Disaster Recovery Testing

Annual multi-day exercises. Take down entire datacenters in a controlled way. Catches systemic dependencies.

07

Used in problems

News feed runs Chaos Monkey on the recommendation tier. YouTube/Netflix exercise CDN failovers via planned chaos. Payment gateway tests provider failover with simulated outages. E-commerce game-days the checkout flow before peak shopping seasons.

Next up