Whiteboard exercise. Try the problem cold, then reveal the rubric to self-score.
Out of 10 points45 min whiteboardReference solution →
01
Prompt
User sends a prompt. API gateway routes it through a tokenizer to a GPU cluster of ~10K A100s running autoregressive inference. Tokens stream back via server-sent events — the user sees words appearing one by one. The hard parts: a KV-cache that avoids O(N²) recomputation on every token, continuous batching (vLLM/Orca) that keeps GPU utilization at ~80% instead of ~30%, tensor + pipeline parallelism to shard 70B+ parameter models across multiple GPUs, a safety pipeline (pre-classifier + post-classifier + RLHF alignment), and per-token billing that meters usage accurately at 1B tokens/day.
Time budget: 45 min whiteboard. Draw architecture, estimate numbers, discuss tradeoffs.
02
Hints (progressive — click to reveal)
Hint 1
Lead with KV-cache. "Without it, generation is O(N squared). With it, O(N). This is the single most important optimization." Shows you understand transformer internals.
Hint 2
Name continuous batching. "vLLM-style continuous batching keeps GPU utilization at 80% vs 30% for static batching." This is the insight that separates candidates who've worked with inference systems.
Hint 3
Explain the parallelism strategy. "Tensor parallelism within a node (NVLink), pipeline parallelism across nodes (InfiniBand). Combine both for 70B+ models." Concrete and correct.
03
Rubric — 10 points
+2 Lead with KV-cache. "Without it, generation is O(N squared). With it, O(N). This is the single most important optimization." Shows you understand transformer internals.
+2 Name continuous batching. "vLLM-style continuous batching keeps GPU utilization at 80% vs 30% for static batching." This is the insight that separates candidates who've worked with inference systems.
+2 Explain the parallelism strategy. "Tensor parallelism within a node (NVLink), pipeline parallelism across nodes (InfiniBand). Combine both for 70B+ models." Concrete and correct.
+2 TTFT is the UX metric. "Time-to-first-token matters more than tokens-per-second for user perception. Prefill dominates TTFT." Shows product thinking.
+2 Safety is multi-layer. "Pre-filter rejects before GPU, post-filter catches mid-stream, RLHF aligns the model itself." Don't just say "add a filter" — describe the pipeline.
Self-score: tally the points you would have mentioned unprompted. 7+ is interview-ready on this problem.
04
Red flags (things that tank the interview)
Static batching — wait for the longest request in the batch to finish before accepting new ones
No KV-cache — recompute all attention from scratch for every new token
Safety filter only after full generation — harmful content already generated and cached