Request Hedging

01

Why this matters

Your service makes 5 backend calls per user request. Each call has a P99 of 50ms — sounds great. But the combined P99 is much worse: even one slow call ruins the request. With 5 calls, the chance of at least one being a P99 outlier is roughly 5%. Tail latency dominates user experience, and reducing it is mathematically harder than reducing average latency.

Request hedging (Jeff Dean, "The Tail at Scale", 2013) is the elegant trick: send the same request to two replicas; take whichever responds first; cancel the other. Tail latency drops dramatically because one slow replica no longer slows you.

02

The math

Replica P99 = 50ms means 1% of requests take ≥ 50ms. Send to 2 independent replicas; both must be slow for you to be slow. P[both slow] = 0.01 × 0.01 = 0.0001. Hedged P99 ≈ replica P99.99 — two orders of magnitude tighter.

For 3 replicas, the chance both/all are slow drops to 10⁻⁶. Diminishing returns kick in fast — most systems use 2-way hedging.

03

How to hedge without 2× cost

Naive hedging doubles backend traffic. Server-side load doubles. Defeats the purpose of having multiple replicas.

Tied requests (Google's variant): send the request to replica A. Wait a short delay (e.g., 95th-percentile latency = 25ms). If no response yet, send to replica B with a "this is hedged" flag. Replicas coordinate: when one starts processing, it notifies the other to abort.

Result: under normal conditions, only one request runs. The hedged second request only fires when the first is unusually slow — covering the slow tail without doubling load.

Hedged Request TimingMermaid

sequenceDiagram participant C as Client participant A as Replica A participant B as Replica B C->>A: request (t=0) Note over C: Wait ≤ 25ms (P95) Note over A: A is slow today (GC pause) C->>B: hedged request (t=25ms) A-->>B: cancel signal B->>B: process B-->>C: response (t=40ms) Note over C: Saved ~10ms — A might have taken 80ms

04

When hedging works (and doesn't)

Works for:

Idempotent reads — re-asking is safe.
Fan-out queries (search, recommendation) — already querying many replicas.
Latency-critical paths — where saving tail latency justifies coordination cost.

Don't hedge:

Mutations (POST/PUT/DELETE) — would execute twice. Use idempotency keys if you must.
Expensive operations — running a 10-second analytics query twice eats 20 seconds of CPU; the hedge isn't free.
When all replicas are correlated — if they're all slow due to a common cause (overloaded shared dependency), hedging multiplies the load on the bottleneck.

05

Deep dive — Google's tail-at-scale techniques

Jeff Dean's 2013 paper "The Tail at Scale" laid out a family of related techniques to fight tail latency:

Hedged requests — described above.
Tied requests — hedging variant where replicas cancel each other; near-zero overhead in the common case.
Probing for free capacity — instead of hedging, route around busy replicas by querying load before sending. Avoids hedge entirely.
Micro-partitioning — make each shard small enough that one node can hold dozens; failure is fine-grained, not catastrophic.
Selective replication — keep extra replicas of "hot" data so reads have many sources to choose from.
Late-binding — defer the choice of which replica to query until the last moment, so you can pick the fastest.

Google search's sub-200ms latency is built on these techniques layered. Without them, querying 1000 shards across hundreds of machines would have a tail latency in the seconds.

Interview cite

"For tail-sensitive reads — autocomplete, search, recommendations — we use tied hedged requests. Wait until P95, then dispatch a backup; replicas coordinate to cancel the loser. P99 latency drops to roughly the replica's P99.99."

06

Real-world

Google search

Hedging at every layer

Search is a planet-scale fan-out across thousands of replicas. Hedging is not optional — without it, tail latency would dominate.

Cassandra speculative retry

Built-in

Configurable per table. After P95, query an additional replica. Same idea, baked into the DB.

gRPC retries with backoff

Spec-level support

gRPC's retry policy allows hedging — multiple parallel attempts with deadline coordination.

Envoy

Hedge policies

Configure per-route: number of hedge requests, delay before triggering, max attempts. No app changes.

07

Used in problems

Typeahead suggestions hedges across replica caches to keep P99 under 100ms. Yelp / Google Places hedges location-shard queries. News feed hedges fan-out reads. Stock trading hedges quote lookups across redundant matchers.

📺

References & Videos

Resilience Patterns

ByteByteGo · 8 min

Hedged Requests — Reducing Tail Latency

Why this matters

The math

How to hedge without 2× cost

When hedging works (and doesn't)

Deep dive — Google's tail-at-scale techniques

Real-world

Hedging at every layer

Built-in

Spec-level support

Hedge policies

Used in problems

References & Videos

Retries, Backoff & Jitter

Latency Numbers

Search Autocomplete