Concept · Machine Learning Systems

LLM Serving Infrastructure

01

Why this matters

Serving GPT-class LLMs is unlike serving any model that came before. A single inference can take 30 seconds, generates thousands of tokens, requires tens of GB of GPU memory, and costs $0.001-0.10 per request. Multiply by 10k QPS and a naive serving stack burns hundreds of GPUs while still hitting OOM errors and 10-second tail latencies.

LLM serving infrastructure is the specialized stack — vLLM, TensorRT-LLM, SGLang, Triton — that solves this. Continuous batching, KV cache management, token streaming, paged attention, speculative decoding. Each squeezes more throughput from the same GPU. Newest area in production ML; biggest gap in most engineering teams' knowledge.

02

What's different about LLM inference

  • Variable-length output. One request might generate 10 tokens; another 4000. Batch sizes are unpredictable.
  • Autoregressive generation. Token N depends on tokens 1..N-1. Can't parallelize within one request — must generate sequentially.
  • Massive KV cache. Each generated token caches its key/value tensors; cache grows linearly with output length. A 4k-token response holds ~10 GB of GPU memory just in cache.
  • Token streaming. Users want tokens as they're generated, not after the full response. SSE or HTTP chunked responses, not standard JSON.
  • GPU underutilization. Decode phase is memory-bandwidth-bound, not FLOP-bound. GPUs sit at 30-40% utilization without specialized batching.

None of these problems exist in classic CNN/recommendation serving. LLMs broke every assumption.

03

The four key optimizations

1. Continuous batching. Old batching: collect 32 requests, pad to longest, run together. With variable-length outputs, the 31 short requests wait for the one long one. Continuous batching: when one request finishes, immediately swap in the next from a queue. GPU never idle. 2-3× throughput.

2. Paged attention (vLLM). KV cache traditionally allocated as a contiguous block — wasteful (allocate for max length, use less). Paged attention treats KV cache like virtual memory — pages of cache allocated per generated token. ~50% memory savings; supports more concurrent requests.

3. Token streaming (SSE). Server pushes tokens as generated. User sees output flowing word-by-word instead of waiting full duration. UX hugely better; same total compute.

4. Speculative decoding. Run a small "draft" model 5 tokens ahead; large model verifies in parallel. If the draft was right (often), 5× throughput on those tokens. Cost: extra small-model compute. Net win: 1.5-3× speedup.

~30 ms
per-token decode (Llama-70B on H100)
~100 GB
model weights + KV cache for 70B
~10 GB
KV cache per 4k-token request
2-5×
vLLM throughput vs naive serving
04

Deep dive — prefill vs decode and why they need different infra

An LLM inference splits into two phases:

Prefill — process the entire input prompt to populate the KV cache. Compute-heavy (matmul-bound), parallelizable across all input tokens. ~50-200ms for a typical prompt. Saturates GPU compute.

Decode — generate output tokens one at a time. Memory-bandwidth-bound (each token reads the entire KV cache). 10-30ms per token. GPU compute mostly idle.

These phases have opposite bottlenecks. Mixing them on the same GPU is suboptimal:

  • One long prefill stalls all decodes happening on the same GPU.
  • Memory-bound decode underutilizes GPU compute.

Disaggregated serving (the 2024 SOTA): dedicate one fleet of GPUs to prefill and another to decode. KV cache transfers between them via high-bandwidth interconnect. Each fleet runs at its bottleneck. ~2-3× throughput vs co-located. DistServe, Splitwise papers; productized by major cloud providers.

Interview answer

"We serve LLMs with vLLM on H100s. Continuous batching + paged attention give ~3× throughput vs HuggingFace transformers. Token streaming via SSE for UX. At higher scale we'd disaggregate prefill and decode onto separate GPU pools to optimize their different bottlenecks."

05

Real-world serving stacks

vLLM

Open-source, dominant

Berkeley research → most-used OSS LLM server. Paged attention + continuous batching. Used by Anthropic Workbench, Vercel, hundreds of startups.

TensorRT-LLM

NVIDIA-optimized

Hand-tuned kernels for NVIDIA GPUs. Best raw performance on H100/A100. Tighter integration; less flexibility.

SGLang

Newer, structured outputs

Adds caching of common prefixes (system prompts), structured JSON output. Strong for tool-using agents.

Anthropic / OpenAI / Google internal

Custom infrastructure

The big foundation-model labs run proprietary serving stacks. Disaggregated, multi-region, sub-second p99 at trillion-token scale.

06

When to self-host vs use API

Use the API (OpenAI, Anthropic, Google): low to moderate volume. ~$1-15 per million tokens. Zero ops. Always get the latest model. Default for < 1B tokens/day.

Self-host with vLLM/TensorRT-LLM:

  • Volume crossover point: ~10-100B tokens/day depending on model size.
  • Data sensitivity: data can't leave your network.
  • Custom fine-tuned model: APIs limit fine-tune, you have your own weights.
  • Latency: dedicated GPUs avoid shared-tenant variability.

Catches: GPUs are scarce + expensive; ops complexity is significant; staying current with model releases is engineering work. Most teams underestimate the operational tax.

07

Used in problems

Typeahead with semantic completions calls an LLM per keystroke — needs sub-100ms p99 + token streaming. Recommendation systems use LLM-generated explanations in cards. News feed uses LLMs for content moderation, summary generation. Any modern chat-shaped product depends on a serving stack like this.

Next up