Concept · Reliability

Circuit Breaker

01

Why this matters

Service B is slow. Service A calls it on every request, each call hanging for 30 seconds before timing out. A's threads fill up waiting for B. A's latency spikes. A's LB marks A unhealthy. Now A is down because B is slow. Cascading failure.

A circuit breaker (pattern, not a box) detects a downstream dependency is failing and stops calling it for a while, failing fast instead. A's threads free up. A stays responsive; it just returns "service unavailable" for the features that needed B. When B recovers, the breaker lets traffic through again.

02

The three states

  • Closed (normal). All calls go through. Breaker counts failures.
  • Open (tripped). Failure rate exceeded threshold → breaker opens. All calls fail immediately, no actual call is made.
  • Half-open (testing). After a cooldown, let one (or a few) test calls through. If they succeed, close the breaker. If they fail, back to open.
Circuit Breaker StatesMermaid
stateDiagram-v2 [*] --> Closed Closed --> Open: failure rate > threshold Open --> HalfOpen: cooldown elapsed HalfOpen --> Closed: test call succeeded HalfOpen --> Open: test call failed
03

Tuning the breaker

Four parameters:

  • Failure threshold — "50% of the last 20 calls failed" OR "10 consecutive failures." Percentage-based handles low-traffic scenarios better than count-based.
  • Minimum call count — don't trip on 1 failure out of 1 call. Require ≥ 20 calls before judging.
  • Cooldown — how long to stay open. Typical: 30s for fast-recovering services, minutes for external APIs.
  • What counts as failure — timeouts? 5xx? 4xx? Usually: timeouts + 5xx. 4xx is the client's fault, not the service's.
Circuit breaker state machine
import time
from enum import Enum

class State(Enum): CLOSED, OPEN, HALF_OPEN = 1, 2, 3

class CircuitBreaker:
    def __init__(self, fail_threshold=5, reset_timeout=30):
        self.fail_threshold = fail_threshold
        self.reset_timeout = reset_timeout
        self.failures = 0
        self.state = State.CLOSED
        self.opened_at = 0

    def call(self, fn, *args, **kw):
        if self.state == State.OPEN:
            if time.time() - self.opened_at >= self.reset_timeout:
                self.state = State.HALF_OPEN  # probe
            else:
                raise Exception("circuit open")
        try:
            r = fn(*args, **kw)
            self._on_success(); return r
        except Exception as e:
            self._on_failure(); raise

    def _on_success(self):
        self.failures = 0; self.state = State.CLOSED

    def _on_failure(self):
        self.failures += 1
        if self.failures >= self.fail_threshold:
            self.state = State.OPEN
            self.opened_at = time.time()
04

Per-call vs per-instance vs per-service

GranularityFailure scopeWhen to use
Per-serviceFailures to any instance of service B count togetherSimplest. Misses per-instance issues.
Per-instanceSeparate breakers per upstream hostOne bad instance doesn't trip the whole service. Best for microservices.
Per-endpointSeparate breaker per (service, path) comboOne slow endpoint in service B doesn't trip calls to other endpoints. Finest grain, most state.
05

Deep dive — fallback strategies

When the breaker is open, what do you return? Options, from best to worst:

  1. Cached response. "Here's your feed from 2 minutes ago." User barely notices. Works for reads.
  2. Degraded response. Home page without the "related products" widget. Core content still served.
  3. Queue for later. Writes get queued to a durable store (Kafka, SQS) and processed when the service recovers.
  4. Sensible default. "Recommendations service down → show trending instead."
  5. Error to user. "Can't complete this right now, try again later." Last resort — but still better than hanging.

The breaker's value isn't in detecting failure (everything does that). It's in giving you a place to define graceful degradation. Without a breaker, failure cascades. With one, failure stays contained at the call site.

Design together

Circuit breaker + retry with backoff + timeouts form the "reliability stack." Never one without the others. Retries without a breaker amplify overload; breakers without retries fail on transient blips.

06

Real-world

Netflix Hystrix

Popularized the pattern

Java library from Netflix (2012). Circuit breaker + fallback + bulkheads + metrics. Now in maintenance; successors include resilience4j.

Envoy / Istio

Sidecar-level

Circuit breakers configured in the service mesh; no app code. Defaults per upstream; overrides per destination.

resilience4j

Modern Java library

Functional API for circuit breakers, retries, rate limiters, bulkheads. Composes into decorators around service calls.

Polly

.NET equivalent

Same patterns. Fluent API: Policy.Handle<Exception>().CircuitBreaker(...).

07

Used in problems

News feed uses circuit breakers around downstream ranker and enrichment services. E-commerce uses them for payment provider calls (fall back to queue). Notification system breaks around SMS/email providers to avoid cascading failure.

Next up