Serving GPT-class LLMs is unlike serving any model that came before. A single inference can take 30 seconds, generates thousands of tokens, requires tens of GB of GPU memory, and costs $0.001-0.10 per request. Multiply by 10k QPS and a naive serving stack burns hundreds of GPUs while still hitting OOM errors and 10-second tail latencies.
LLM serving infrastructure is the specialized stack — vLLM, TensorRT-LLM, SGLang, Triton — that solves this. Continuous batching, KV cache management, token streaming, paged attention, speculative decoding. Each squeezes more throughput from the same GPU. Newest area in production ML; biggest gap in most engineering teams' knowledge.
02
What's different about LLM inference
Variable-length output. One request might generate 10 tokens; another 4000. Batch sizes are unpredictable.
Autoregressive generation. Token N depends on tokens 1..N-1. Can't parallelize within one request — must generate sequentially.
Massive KV cache. Each generated token caches its key/value tensors; cache grows linearly with output length. A 4k-token response holds ~10 GB of GPU memory just in cache.
Token streaming. Users want tokens as they're generated, not after the full response. SSE or HTTP chunked responses, not standard JSON.
GPU underutilization. Decode phase is memory-bandwidth-bound, not FLOP-bound. GPUs sit at 30-40% utilization without specialized batching.
None of these problems exist in classic CNN/recommendation serving. LLMs broke every assumption.
03
The four key optimizations
1. Continuous batching. Old batching: collect 32 requests, pad to longest, run together. With variable-length outputs, the 31 short requests wait for the one long one. Continuous batching: when one request finishes, immediately swap in the next from a queue. GPU never idle. 2-3× throughput.
2. Paged attention (vLLM). KV cache traditionally allocated as a contiguous block — wasteful (allocate for max length, use less). Paged attention treats KV cache like virtual memory — pages of cache allocated per generated token. ~50% memory savings; supports more concurrent requests.
3. Token streaming (SSE). Server pushes tokens as generated. User sees output flowing word-by-word instead of waiting full duration. UX hugely better; same total compute.
4. Speculative decoding. Run a small "draft" model 5 tokens ahead; large model verifies in parallel. If the draft was right (often), 5× throughput on those tokens. Cost: extra small-model compute. Net win: 1.5-3× speedup.
~30 ms
per-token decode (Llama-70B on H100)
~100 GB
model weights + KV cache for 70B
~10 GB
KV cache per 4k-token request
2-5×
vLLM throughput vs naive serving
04
Deep dive — prefill vs decode and why they need different infra
An LLM inference splits into two phases:
Prefill — process the entire input prompt to populate the KV cache. Compute-heavy (matmul-bound), parallelizable across all input tokens. ~50-200ms for a typical prompt. Saturates GPU compute.
Decode — generate output tokens one at a time. Memory-bandwidth-bound (each token reads the entire KV cache). 10-30ms per token. GPU compute mostly idle.
These phases have opposite bottlenecks. Mixing them on the same GPU is suboptimal:
One long prefill stalls all decodes happening on the same GPU.
Memory-bound decode underutilizes GPU compute.
Disaggregated serving (the 2024 SOTA): dedicate one fleet of GPUs to prefill and another to decode. KV cache transfers between them via high-bandwidth interconnect. Each fleet runs at its bottleneck. ~2-3× throughput vs co-located. DistServe, Splitwise papers; productized by major cloud providers.
Interview answer
"We serve LLMs with vLLM on H100s. Continuous batching + paged attention give ~3× throughput vs HuggingFace transformers. Token streaming via SSE for UX. At higher scale we'd disaggregate prefill and decode onto separate GPU pools to optimize their different bottlenecks."
05
Real-world serving stacks
vLLM
Open-source, dominant
Berkeley research → most-used OSS LLM server. Paged attention + continuous batching. Used by Anthropic Workbench, Vercel, hundreds of startups.
TensorRT-LLM
NVIDIA-optimized
Hand-tuned kernels for NVIDIA GPUs. Best raw performance on H100/A100. Tighter integration; less flexibility.
SGLang
Newer, structured outputs
Adds caching of common prefixes (system prompts), structured JSON output. Strong for tool-using agents.
Anthropic / OpenAI / Google internal
Custom infrastructure
The big foundation-model labs run proprietary serving stacks. Disaggregated, multi-region, sub-second p99 at trillion-token scale.
06
When to self-host vs use API
Use the API (OpenAI, Anthropic, Google): low to moderate volume. ~$1-15 per million tokens. Zero ops. Always get the latest model. Default for < 1B tokens/day.
Self-host with vLLM/TensorRT-LLM:
Volume crossover point: ~10-100B tokens/day depending on model size.
Data sensitivity: data can't leave your network.
Custom fine-tuned model: APIs limit fine-tune, you have your own weights.
Catches: GPUs are scarce + expensive; ops complexity is significant; staying current with model releases is engineering work. Most teams underestimate the operational tax.
07
Used in problems
Typeahead with semantic completions calls an LLM per keystroke — needs sub-100ms p99 + token streaming. Recommendation systems use LLM-generated explanations in cards. News feed uses LLMs for content moderation, summary generation. Any modern chat-shaped product depends on a serving stack like this.