Load Balancer

01

The problem it solves

A single server can serve maybe a few thousand requests per second before CPU, memory, or connection limits pin it. The moment your traffic crosses that ceiling, you have two options: buy a much bigger server (vertical scaling — expensive, has a hard ceiling, single point of failure) or put many cheap servers behind something that spreads traffic across them. That "something" is a load balancer.

Without a load balancer, horizontal scaling is impossible. Clients would need to know every backend server by IP, pick one, and retry when it's down. The load balancer hides the fleet behind one stable address and deals with failure, capacity, and routing on the client's behalf.

What it actually does

Terminates client connections, picks a healthy backend, forwards the request, and returns the response. Bonus: it pulls dead servers out of rotation within seconds and handles TLS termination so backends can speak plain HTTP internally.

02

Intuition

Picture a grocery store on Sunday afternoon. Twelve checkout lanes are open. You don't walk in and pick lane 7 because you memorised it — you look for the shortest queue, or a greeter directs you to the next free cashier. If a cash register breaks, the greeter stops sending you there. If the store gets busier, they open more lanes.

The greeter is a load balancer. The cashiers are your backend servers. The shoppers are your requests. The whole point: the shoppers don't need to know which cashiers exist — they just enter the store and get served.

Traffic Spreading SVG

03

How it works

A request arrives at the load balancer's IP. The LB does four things:

Accepts the connection (TCP SYN/ACK) and optionally terminates TLS so it can read HTTP headers.
Picks a backend using a chosen algorithm (see §4) — round robin, least connections, IP hash, etc.
Opens or reuses a connection to that backend and forwards bytes. The client has no idea which backend answered.
Watches health continuously: every healthCheckInterval (often 2–5s), the LB pings each backend on a known path like /health. Three failed checks in a row → the backend is yanked out of rotation. The next time a check succeeds, it's put back.

There's a subtle detail here: health checks decide fleet membership, not the request router. The router only picks from backends the health checker has marked healthy. This decoupling is why a load balancer can remove a crashing server in 5 seconds without the router needing to know anything about crashes.

Request Lifecycle Mermaid.js

sequenceDiagram participant C as Client participant LB as Load Balancer participant HC as Health Checker participant S1 as Server 1 participant S2 as Server 2 Note over HC,S2: Every 3s — running in background HC->>S1: GET /health S1-->>HC: 200 OK HC->>S2: GET /health S2-->>HC: timeout Note over HC: S2 marked unhealthy after 3 fails C->>LB: POST /api/order Note over LB: Pick from {S1} only LB->>S1: Forward request S1-->>LB: 201 Created LB-->>C: 201 Created Note over HC,S2: S2 comes back up HC->>S2: GET /health S2-->>HC: 200 OK Note over HC: S2 re-added to pool

~100μs

LB overhead per request

2–5s

health check interval

3×

failures before removal

30–60s

drain time (in-flight requests)

04

Variants — Layer 4 vs Layer 7, and the algorithm menu

The single biggest decision is what layer you operate at. Everything else flows from it.

Layer 4 · Transport

Forwards TCP/UDP packets

The LB sees the 4-tuple (src IP, src port, dst IP, dst port) and picks a backend once per connection. It doesn't read the HTTP request — it just shovels bytes. Pros: very fast, handles any protocol, fewer moving parts. Cons: can't route by URL path, can't decrypt TLS, can't add a trailing /.

Layer 7 · Application

Understands HTTP (the usual choice)

The LB parses the request line and headers, so it can route /api/* to one service and /images/* to another, do sticky sessions by cookie, add auth headers, terminate TLS, compress responses. Pros: content-aware routing is a superpower. Cons: higher CPU cost; you're on the critical path for every byte.

Once you've picked a layer, you pick an algorithm for distributing load across backends:

Algorithm	How it picks	Best for	Watch out for
Round Robin	Rotates: 1 → 2 → 3 → 1 → …	Stateless services, uniform request cost	Ignores that request #42 was a 10-second report — the slow backend gets the next request too
Least Connections	Sends to the backend with fewest open connections right now	Long-lived connections (WebSockets, DB proxies), variable request cost	"Connection count ≠ work" — a backend idle on 5 long polls looks busier than one grinding CPU on 2
IP Hash	`hash(client_ip) % N` picks the backend	Poor-man's sticky sessions without cookies	Rebalances wildly when N changes (use consistent hashing)
Weighted RR	Bigger servers get more turns — e.g., 4:2:1	Mixed fleet (some old boxes, some new)	Manual weights drift out of date as hardware ages
Power of Two Choices (P2C)	Pick 2 random backends, send to the one with fewer active requests	Most modern systems — near-optimal with almost no state	Nothing obvious. This is what Envoy and AWS ALB default to.

Default to this

For most HTTP APIs: L7 load balancing + Power-of-Two-Choices. It handles variable request cost gracefully, routes by path, and doesn't need the tuning of RR or least-conn.

05

Tradeoffs — when NOT to reach for a load balancer

Load balancers buy you scale and availability, but they cost:

Latency — an extra network hop (~0.1–1ms inside a region, worse across AZs).
Cost — an AWS ALB at 1M requests/hour runs roughly $20/day per LB. Not free.
A new failure mode — the LB itself. Mitigate with multiple LB instances, DNS round-robin between them, and anycast IPs.
Stickiness headaches — if your backends hold session state in memory, the LB has to route the same user to the same server (cookie-based sticky sessions). This defeats the point of a fleet. Fix the stickiness at the backend layer — make services stateless and put session state in Redis.

You can skip a load balancer when:

Traffic is low enough that one server handles it with room to spare and you don't need HA.
The clients are inside your VPC and you can use client-side load balancing — the client itself has the backend list (via service discovery) and picks one. This is how gRPC load balancing and Netflix's Ribbon work. No dedicated LB in the path.
DNS round-robin is enough (e.g., read-only static content, accept slow failover).

06

Deep dive — Why Power of Two Choices wins

Imagine 1000 requests arriving at 100 backends. With random placement, the most-loaded backend ends up with about log N / log log N ≈ 5–6 requests when the average is 10. That 50–60% overload on the hot backend is the tail latency you see.

With round robin, load is perfectly balanced if every request costs the same. They don't. One slow query on server 3 makes round robin keep sending it more work.

With least connections, the LB tracks the exact load of every backend. That's O(N) state per decision, and requires constant updates — expensive at scale.

Power of Two Choices (Mitzenmacher, 1996): pick any two backends at random, send the request to the less-loaded of the two. With almost no state, the max load drops from log N / log log N to log log N / log 2 — for N=100, that's ~3.3, nearly half of random. The math is surprising and it's called the "power of two" exactly because two choices already captures most of the gain; going to three barely helps.

Max Load vs Strategy Mermaid.js

flowchart LR A[Random
max ≈ log N / log log N] --> B[Power of Two
max ≈ log log N] B --> C[Perfect Round Robin
max = ceiling of avg] style A fill:#fce8e6,stroke:#b03a2e style B fill:#e4f5ec,stroke:#1a5c38 style C fill:#e4f5ec,stroke:#1a5c38

Real systems go one step further — P2C + EWMA (exponentially-weighted moving average of response time). Instead of "connection count," each backend tracks its recent latency. The LB picks two random backends and routes to the one with lower EWMA. This is what Envoy's LEAST_REQUEST with v2_least_request_lb_config.choice_count=2 actually does.

Interview answer

"Default to L7 load balancing with Power of Two Choices. It's within a few percent of optimal, requires no per-backend state, and degrades gracefully when the fleet is heterogeneous."

Max Load — 1000 Requests over 100 Backends SVG

07

Real-world systems

AWS ALB

Managed L7 load balancer

Default choice inside AWS. Round robin + least outstanding requests. Terminates TLS, integrates with Cognito for auth, and handles path/host-based routing to multiple target groups.

AWS NLB

Managed L4 load balancer

For TCP/UDP (gaming, VoIP, non-HTTP). Handles millions of RPS with single-digit millisecond overhead. Preserves client IP via proxy protocol.

HAProxy

Open-source, battle-tested

Runs the vast majority of the world's non-cloud L4/L7 load balancing. Config-file driven, single binary, microsecond latency.

Nginx

Reverse proxy + load balancer

Dual-purpose: serves static content AND load-balances to upstreams. Common in front of application servers as an L7 LB.

Envoy

Modern service-mesh data plane

What Istio, Consul Connect, and Lyft (who built it) use. First-class P2C support, gRPC-native, xDS API for dynamic config. The best-of-breed L7 LB in 2025.

Cloudflare

Global L7 at the edge

Load-balancing across your origin servers, usually with DNS steering + health checks. Anycast means the LB runs in ~300 PoPs simultaneously.

08

Used in problems

The sidebar shows every problem page in this portfolio where a load balancer sits in the architecture. Nearly every high-traffic system uses one — the exceptions are single-tenant admin tools and batch pipelines.

📺

References & Videos

Load Balancing — System Design Basics

Gaurav Sen · 12 min

Load Balancer Explained

ByteByteGo · 8 min

Layer 4 vs Layer 7 Load Balancing

Hussein Nasser · 20 min

Load Balancing Algorithms

AlgoMaster

Load Balancing in System Design

GeeksforGeeks