Bulkhead Isolation

01

Why this matters

Your service has 200 threads. It calls services A, B, C. Service C hangs. Requests to C accumulate; all 200 threads end up waiting on C. Requests to A and B can't get served because no thread is free — even though A and B are perfectly healthy.

Bulkheads — borrowed from shipbuilding, where watertight compartments prevent one leak from sinking the whole ship — separate resource pools per dependency. Thread pool for A, thread pool for B, thread pool for C. C going bad can saturate its own pool, but A and B keep serving.

02

The failure it prevents

Without bulkheads, a slow dependency can cause resource exhaustion in the caller — connections, threads, or memory all tied up waiting. The caller can no longer serve any request, even ones that don't touch the slow dependency. This is how a single slow service kills an entire platform.

With bulkheads, each dependency has its own pool. When dependency C saturates, only calls to C start queuing/rejecting. Everything else is untouched.

03

Two main forms

Thread pool isolation

One pool per downstream

Service has thread pools: 50 threads for DB, 30 for service A, 20 for service B, 10 for payment provider. Calls go through their dedicated pool. Payment provider hangs → 10 threads stuck → other 90 keep working.

Semaphore isolation

Counting semaphore per downstream

Shared thread pool, but each dependency limited to N concurrent calls (via semaphore). Lighter weight than separate thread pools. No context-switch overhead. Can't handle truly blocking calls (which pin the thread anyway) — suited for fast-nonblocking calls.

04

Bulkheads at every level

Service level. Separate connection pools per downstream.
Container level. Run two copies of a service in separate pods for tenant isolation. "Tenant A's bug doesn't affect tenant B."
Process level. Noisy-neighbor isolation on a host by running each tenant in its own container with CPU/memory limits.
Database level. Separate DB replicas per service or tenant. One team's bad query doesn't lock the shared DB.
Regional level. One region failing doesn't affect others. The biggest bulkhead of all.

05

Tradeoffs

Cost: reserved capacity per bulkhead = unused capacity during normal operation. If your DB pool has 50 threads but only 10 are busy, those 40 idle threads are "wasted" compared to a shared 200-thread pool where everyone competes.

Tuning: pool size per dependency must match its normal + peak load. Too small and you throttle healthy traffic. Too large and you don't actually bulkhead. Usually: size = normal_concurrency × 1.5.

Complexity: each pool is a separate thing to monitor, tune, and alert on. Worth it only for dependencies whose failure you can survive — for critical-path synchronous dependencies, bulkheads just surface failure faster (which is still better than cascading).

06

Deep dive — the cell architecture

AWS's internal answer to the "one failure takes down everything" problem is cell-based architecture. Each cell is a complete, self-contained stack serving a shard of users. Cell 1 serves 10% of customers; cell 2 serves the next 10%; and so on. A bug or corruption in cell 3 affects only 10% of traffic.

Cells are the mother of all bulkheads: entire vertical slices isolated from each other. The tradeoff: 10 cells means 10× the infra management. AWS does this because at their scale, "small outage" = millions of users, and 10% small is better than 100% small.

For most systems, this is overkill. But the principle — isolate failure domains at the largest practical boundary — applies at every scale. Run separate Kubernetes namespaces per team. Separate Redis clusters per tenant. Separate databases per region. The pattern repeats.

The guiding principle

Fault is local. Fault is bounded. Fault does not spread.

07

Real-world

Netflix Hystrix

Thread pools per dependency

Default bulkhead was 10 threads per downstream command. Popularized the pattern in the microservice era.

resilience4j / Polly

Bulkhead operators

Declarative: wrap a call in a bulkhead decorator; specify max concurrency. Failure isolation by composition.

Kubernetes resource quotas

Namespace isolation

Per-namespace CPU/memory limits. One team's runaway pod can't consume cluster resources meant for others.

AWS cells

DynamoDB, S3 internal architecture

Dynamic sharding into cells with blast-radius limits. Not user-visible; the reason these services almost never go fully down.

08

Used in problems

E-commerce uses bulkheads around payment provider, tax service, shipping API. Payment gateway isolates per-provider thread pools. Notification system bulkheads per delivery channel (SMS, email, push).

📺