A single server can serve maybe a few thousand requests per second before CPU, memory, or connection limits pin it. The moment your traffic crosses that ceiling, you have two options: buy a much bigger server (vertical scaling — expensive, has a hard ceiling, single point of failure) or put many cheap servers behind something that spreads traffic across them. That "something" is a load balancer.
Without a load balancer, horizontal scaling is impossible. Clients would need to know every backend server by IP, pick one, and retry when it's down. The load balancer hides the fleet behind one stable address and deals with failure, capacity, and routing on the client's behalf.
What it actually does
Terminates client connections, picks a healthy backend, forwards the request, and returns the response. Bonus: it pulls dead servers out of rotation within seconds and handles TLS termination so backends can speak plain HTTP internally.
02
Intuition
Picture a grocery store on Sunday afternoon. Twelve checkout lanes are open. You don't walk in and pick lane 7 because you memorised it — you look for the shortest queue, or a greeter directs you to the next free cashier. If a cash register breaks, the greeter stops sending you there. If the store gets busier, they open more lanes.
The greeter is a load balancer. The cashiers are your backend servers. The shoppers are your requests. The whole point: the shoppers don't need to know which cashiers exist — they just enter the store and get served.
Traffic SpreadingSVG
03
How it works
A request arrives at the load balancer's IP. The LB does four things:
Accepts the connection (TCP SYN/ACK) and optionally terminates TLS so it can read HTTP headers.
Picks a backend using a chosen algorithm (see §4) — round robin, least connections, IP hash, etc.
Opens or reuses a connection to that backend and forwards bytes. The client has no idea which backend answered.
Watches health continuously: every healthCheckInterval (often 2–5s), the LB pings each backend on a known path like /health. Three failed checks in a row → the backend is yanked out of rotation. The next time a check succeeds, it's put back.
There's a subtle detail here: health checks decide fleet membership, not the request router. The router only picks from backends the health checker has marked healthy. This decoupling is why a load balancer can remove a crashing server in 5 seconds without the router needing to know anything about crashes.
Request LifecycleMermaid.js
sequenceDiagram
participant C as Client
participant LB as Load Balancer
participant HC as Health Checker
participant S1 as Server 1
participant S2 as Server 2
Note over HC,S2: Every 3s — running in background
HC->>S1: GET /health
S1-->>HC: 200 OK
HC->>S2: GET /health
S2-->>HC: timeout
Note over HC: S2 marked unhealthy after 3 fails
C->>LB: POST /api/order
Note over LB: Pick from {S1} only
LB->>S1: Forward request
S1-->>LB: 201 Created
LB-->>C: 201 Created
Note over HC,S2: S2 comes back up
HC->>S2: GET /health
S2-->>HC: 200 OK
Note over HC: S2 re-added to pool
~100μs
LB overhead per request
2–5s
health check interval
3×
failures before removal
30–60s
drain time (in-flight requests)
04
Variants — Layer 4 vs Layer 7, and the algorithm menu
The single biggest decision is what layer you operate at. Everything else flows from it.
Layer 4 · Transport
Forwards TCP/UDP packets
The LB sees the 4-tuple (src IP, src port, dst IP, dst port) and picks a backend once per connection. It doesn't read the HTTP request — it just shovels bytes. Pros: very fast, handles any protocol, fewer moving parts. Cons: can't route by URL path, can't decrypt TLS, can't add a trailing /.
Layer 7 · Application
Understands HTTP (the usual choice)
The LB parses the request line and headers, so it can route /api/* to one service and /images/* to another, do sticky sessions by cookie, add auth headers, terminate TLS, compress responses. Pros: content-aware routing is a superpower. Cons: higher CPU cost; you're on the critical path for every byte.
Once you've picked a layer, you pick an algorithm for distributing load across backends:
Algorithm
How it picks
Best for
Watch out for
Round Robin
Rotates: 1 → 2 → 3 → 1 → …
Stateless services, uniform request cost
Ignores that request #42 was a 10-second report — the slow backend gets the next request too
Least Connections
Sends to the backend with fewest open connections right now
Long-lived connections (WebSockets, DB proxies), variable request cost
"Connection count ≠ work" — a backend idle on 5 long polls looks busier than one grinding CPU on 2
Pick 2 random backends, send to the one with fewer active requests
Most modern systems — near-optimal with almost no state
Nothing obvious. This is what Envoy and AWS ALB default to.
Default to this
For most HTTP APIs: L7 load balancing + Power-of-Two-Choices. It handles variable request cost gracefully, routes by path, and doesn't need the tuning of RR or least-conn.
05
Tradeoffs — when NOT to reach for a load balancer
Load balancers buy you scale and availability, but they cost:
Latency — an extra network hop (~0.1–1ms inside a region, worse across AZs).
Cost — an AWS ALB at 1M requests/hour runs roughly $20/day per LB. Not free.
A new failure mode — the LB itself. Mitigate with multiple LB instances, DNS round-robin between them, and anycast IPs.
Stickiness headaches — if your backends hold session state in memory, the LB has to route the same user to the same server (cookie-based sticky sessions). This defeats the point of a fleet. Fix the stickiness at the backend layer — make services stateless and put session state in Redis.
You can skip a load balancer when:
Traffic is low enough that one server handles it with room to spare and you don't need HA.
The clients are inside your VPC and you can use client-side load balancing — the client itself has the backend list (via service discovery) and picks one. This is how gRPC load balancing and Netflix's Ribbon work. No dedicated LB in the path.
DNS round-robin is enough (e.g., read-only static content, accept slow failover).
06
Deep dive — Why Power of Two Choices wins
Imagine 1000 requests arriving at 100 backends. With random placement, the most-loaded backend ends up with about log N / log log N ≈ 5–6 requests when the average is 10. That 50–60% overload on the hot backend is the tail latency you see.
With round robin, load is perfectly balanced if every request costs the same. They don't. One slow query on server 3 makes round robin keep sending it more work.
With least connections, the LB tracks the exact load of every backend. That's O(N) state per decision, and requires constant updates — expensive at scale.
Power of Two Choices (Mitzenmacher, 1996): pick any two backends at random, send the request to the less-loaded of the two. With almost no state, the max load drops from log N / log log N to log log N / log 2 — for N=100, that's ~3.3, nearly half of random. The math is surprising and it's called the "power of two" exactly because two choices already captures most of the gain; going to three barely helps.
Max Load vs StrategyMermaid.js
flowchart LR
A[Random max ≈ log N / log log N] --> B[Power of Two max ≈ log log N]
B --> C[Perfect Round Robin max = ceiling of avg]
style A fill:#fce8e6,stroke:#b03a2e
style B fill:#e4f5ec,stroke:#1a5c38
style C fill:#e4f5ec,stroke:#1a5c38
Real systems go one step further — P2C + EWMA (exponentially-weighted moving average of response time). Instead of "connection count," each backend tracks its recent latency. The LB picks two random backends and routes to the one with lower EWMA. This is what Envoy's LEAST_REQUEST with v2_least_request_lb_config.choice_count=2 actually does.
Interview answer
"Default to L7 load balancing with Power of Two Choices. It's within a few percent of optimal, requires no per-backend state, and degrades gracefully when the fleet is heterogeneous."
Max Load — 1000 Requests over 100 BackendsSVG
07
Real-world systems
AWS ALB
Managed L7 load balancer
Default choice inside AWS. Round robin + least outstanding requests. Terminates TLS, integrates with Cognito for auth, and handles path/host-based routing to multiple target groups.
AWS NLB
Managed L4 load balancer
For TCP/UDP (gaming, VoIP, non-HTTP). Handles millions of RPS with single-digit millisecond overhead. Preserves client IP via proxy protocol.
HAProxy
Open-source, battle-tested
Runs the vast majority of the world's non-cloud L4/L7 load balancing. Config-file driven, single binary, microsecond latency.
Nginx
Reverse proxy + load balancer
Dual-purpose: serves static content AND load-balances to upstreams. Common in front of application servers as an L7 LB.
Envoy
Modern service-mesh data plane
What Istio, Consul Connect, and Lyft (who built it) use. First-class P2C support, gRPC-native, xDS API for dynamic config. The best-of-breed L7 LB in 2025.
Cloudflare
Global L7 at the edge
Load-balancing across your origin servers, usually with DNS steering + health checks. Anycast means the LB runs in ~300 PoPs simultaneously.
08
Used in problems
The sidebar shows every problem page in this portfolio where a load balancer sits in the architecture. Nearly every high-traffic system uses one — the exceptions are single-tenant admin tools and batch pipelines.