Retries, Backoff & Jitter

01

Why this matters

Network calls fail. Retry solves 90% of transient failures for free. But naive retries cause retry storms — every client retrying the failing service simultaneously, which is exactly what the failing service can least handle. The fix is boring and essential: exponential backoff plus jitter.

"We retry on failure" isn't an answer. "We retry up to 3 times with exponential backoff + full jitter + a budget" is.

02

Three disasters of naive retry

Disaster 1 — immediate retry. First attempt fails. Retry in 0ms. That retry also fails (same network issue). Retry again. 10,000 requests/sec now becomes 30,000 requests/sec aimed at a service that couldn't even handle 10k. The service gets worse, not better.

Disaster 2 — synchronized retry. Service outage at T=0. All 10,000 clients notice simultaneously. All retry at T+1s. Exactly-one-second spike of 10k requests hits the recovering service. Repeat at T+2s, T+4s... The service never gets a quiet moment to recover.

Disaster 3 — unbounded retry. Every service in the call chain retries independently. Client retries A (3×), A retries B (3×), B retries C (3×). One original request becomes 27 requests under failure. Amplification cascades through the stack.

03

Exponential backoff + full jitter

The battle-tested recipe:

Exponential backoff. Wait 100ms, then 200, 400, 800... doubling each retry. Gives the service time to recover.
Full jitter (AWS paper, 2015). Replace the exact wait with random(0, current_backoff). Spreads clients' retries over the interval — no synchronized spike.
Max retries. 3–5 is typical. Beyond that, accept failure.
Max backoff cap. 30 seconds or so. Don't wait an hour for retry 7.

Pseudocode:

for attempt in 1..max_retries:
  try: return call()
  except Transient:
    base = min(max_backoff, initial * 2^attempt)
    wait = random(0, base)
    sleep(wait)
throw LastError

Exponential backoff with jitter

import random, time

def retry(fn, max_attempts=5, base=0.1, cap=10.0):
    """Full-jitter exponential backoff (AWS recommended)."""
    for attempt in range(max_attempts):
        try:
            return fn()
        except Exception:
            if attempt == max_attempts - 1: raise
            # base × 2^attempt, capped; uniform jitter 0..backoff
            backoff = min(cap, base * (2 ** attempt))
            time.sleep(random.uniform(0, backoff))

# Attempt delays: 0-0.1s, 0-0.2s, 0-0.4s, 0-0.8s, 0-1.6s
# Full-jitter outperforms "backoff + small jitter" for contended resources

04

What to retry, what NOT to retry

Error	Retry?	Why
Network timeout	Yes	Transient. Next attempt may succeed.
5xx (server error)	Yes	Server's fault, not yours. May recover.
503 Service Unavailable	Yes, with backoff	Explicit "try later" signal.
429 Too Many Requests	Yes, honoring Retry-After	Rate limited. Server told you when to retry.
4xx other	No	Client's fault. Retrying won't help.
Non-idempotent POST	Only with idempotency key	Retry might double-execute. Dangerous without guardrails.

05

Deep dive — retry budgets

Even exponential-backoff retries amplify load during outages. If 100% of requests are failing, each client's 3 retries means the failing service gets 3× its normal traffic precisely when it can't handle it.

A retry budget caps the retry multiplier globally. Rule: "retries may be at most 20% of normal RPS." If all requests are failing, retries are capped — you fail fast for most requests rather than amplifying.

Google's SRE book and AWS SDKs implement this. Envoy's retry_budget config exposes it. The net: under steady-state failure, the service sees at most 1.2× normal RPS instead of 3× or 10×. Recovery is measurably faster.

Production answer

"Exponential backoff with full jitter, max 3 retries, 30s max backoff, retry budget capped at 20% of baseline traffic. Combine with a circuit breaker so sustained failure stops retries entirely."

06

Real-world

AWS SDKs

Full jitter baked in

Every SDK retries with exponential + full jitter by default. Configurable per service. The reference implementation.

Envoy / gRPC

Retry + budget

Declarative retry policies in config. Retry budget caps total amplification. No app code.

Stripe

Client SDK retry + idempotency keys

Retries with exponential backoff; idempotency keys ensure POSTs are safe to retry. Industry benchmark.

HTTP Retry-After header

Server-directed retry

When rate-limited (429) or overloaded (503), server sends Retry-After: 10. Good clients honor it — respects server's backpressure signal.

07

Used in problems

Notification system retries failed deliveries with exponential backoff. Payment gateway retries idempotent charge confirmations. Web crawler backs off when hitting rate-limited domains.

📺

References & Videos

Retry Strategies & Circuit Breaker

ByteByteGo · 8 min

Retry with Exponential Backoff

Gaurav Sen · 15 min

Exponential Backoff and Jitter

AWS Architecture Blog

Exponential Backoff Algorithm

GeeksforGeeks

Why this matters

Three disasters of naive retry

Exponential backoff + full jitter

What to retry, what NOT to retry

Deep dive — retry budgets

Real-world

Full jitter baked in

Retry + budget

Client SDK retry + idempotency keys

Server-directed retry

Used in problems

References & Videos

Circuit Breaker

Idempotency

Notification System