Why this matters
You think your system handles a database failover. The runbook says it does. The diagram says it does. But you've never actually tested it. The first time it happens for real — at 3am, with no warning — you discover the standby has stale data, the connection pool doesn't reconnect, and the load balancer takes 90s to notice. Every distributed system carries this kind of silent fragility.
Chaos engineering is the practice of deliberately injecting failures in production (or production-like environments) to find these gaps before they find you. Pioneered by Netflix as Chaos Monkey; now mainstream at every serious production org.