Concept · Reliability

Retries, Backoff & Jitter

01

Why this matters

Network calls fail. Retry solves 90% of transient failures for free. But naive retries cause retry storms — every client retrying the failing service simultaneously, which is exactly what the failing service can least handle. The fix is boring and essential: exponential backoff plus jitter.

"We retry on failure" isn't an answer. "We retry up to 3 times with exponential backoff + full jitter + a budget" is.

02

Three disasters of naive retry

Disaster 1 — immediate retry. First attempt fails. Retry in 0ms. That retry also fails (same network issue). Retry again. 10,000 requests/sec now becomes 30,000 requests/sec aimed at a service that couldn't even handle 10k. The service gets worse, not better.

Disaster 2 — synchronized retry. Service outage at T=0. All 10,000 clients notice simultaneously. All retry at T+1s. Exactly-one-second spike of 10k requests hits the recovering service. Repeat at T+2s, T+4s... The service never gets a quiet moment to recover.

Disaster 3 — unbounded retry. Every service in the call chain retries independently. Client retries A (3×), A retries B (3×), B retries C (3×). One original request becomes 27 requests under failure. Amplification cascades through the stack.

03

Exponential backoff + full jitter

The battle-tested recipe:

  1. Exponential backoff. Wait 100ms, then 200, 400, 800... doubling each retry. Gives the service time to recover.
  2. Full jitter (AWS paper, 2015). Replace the exact wait with random(0, current_backoff). Spreads clients' retries over the interval — no synchronized spike.
  3. Max retries. 3–5 is typical. Beyond that, accept failure.
  4. Max backoff cap. 30 seconds or so. Don't wait an hour for retry 7.

Pseudocode:

for attempt in 1..max_retries:
  try: return call()
  except Transient:
    base = min(max_backoff, initial * 2^attempt)
    wait = random(0, base)
    sleep(wait)
throw LastError
Exponential backoff with jitter
import random, time

def retry(fn, max_attempts=5, base=0.1, cap=10.0):
    """Full-jitter exponential backoff (AWS recommended)."""
    for attempt in range(max_attempts):
        try:
            return fn()
        except Exception:
            if attempt == max_attempts - 1: raise
            # base × 2^attempt, capped; uniform jitter 0..backoff
            backoff = min(cap, base * (2 ** attempt))
            time.sleep(random.uniform(0, backoff))

# Attempt delays: 0-0.1s, 0-0.2s, 0-0.4s, 0-0.8s, 0-1.6s
# Full-jitter outperforms "backoff + small jitter" for contended resources
04

What to retry, what NOT to retry

ErrorRetry?Why
Network timeoutYesTransient. Next attempt may succeed.
5xx (server error)YesServer's fault, not yours. May recover.
503 Service UnavailableYes, with backoffExplicit "try later" signal.
429 Too Many RequestsYes, honoring Retry-AfterRate limited. Server told you when to retry.
4xx otherNoClient's fault. Retrying won't help.
Non-idempotent POSTOnly with idempotency keyRetry might double-execute. Dangerous without guardrails.
05

Deep dive — retry budgets

Even exponential-backoff retries amplify load during outages. If 100% of requests are failing, each client's 3 retries means the failing service gets 3× its normal traffic precisely when it can't handle it.

A retry budget caps the retry multiplier globally. Rule: "retries may be at most 20% of normal RPS." If all requests are failing, retries are capped — you fail fast for most requests rather than amplifying.

Google's SRE book and AWS SDKs implement this. Envoy's retry_budget config exposes it. The net: under steady-state failure, the service sees at most 1.2× normal RPS instead of 3× or 10×. Recovery is measurably faster.

Production answer

"Exponential backoff with full jitter, max 3 retries, 30s max backoff, retry budget capped at 20% of baseline traffic. Combine with a circuit breaker so sustained failure stops retries entirely."

06

Real-world

AWS SDKs

Full jitter baked in

Every SDK retries with exponential + full jitter by default. Configurable per service. The reference implementation.

Envoy / gRPC

Retry + budget

Declarative retry policies in config. Retry budget caps total amplification. No app code.

Stripe

Client SDK retry + idempotency keys

Retries with exponential backoff; idempotency keys ensure POSTs are safe to retry. Industry benchmark.

HTTP Retry-After header

Server-directed retry

When rate-limited (429) or overloaded (503), server sends Retry-After: 10. Good clients honor it — respects server's backpressure signal.

07

Used in problems

Notification system retries failed deliveries with exponential backoff. Payment gateway retries idempotent charge confirmations. Web crawler backs off when hitting rate-limited domains.

Next up