(a) KV-Cache. During autoregressive generation, each new token's attention layer needs the keys and values from ALL prior tokens. Without a cache, generating token N requires recomputing attention over all N-1 previous tokens — O(N²) total work for a sequence of length N. The KV-cache stores the K and V tensors from every prior layer/token in GPU HBM. Each new token only computes its own K/V, appends to the cache, and attends over the cached values. Cost: ~1.5 GB of GPU memory per 100K-context request on a 70B model. This is why long-context requests are expensive — they consume GPU memory, not just compute.
// KV-cache pseudocode per attention layer
cache_k[layer].append(new_token_k) // shape: [seq_len, head_dim]
cache_v[layer].append(new_token_v)
attn_output = softmax(new_token_q @ cache_k.T / sqrt(d)) @ cache_v
// Only new_token_q is computed fresh; cache_k, cache_v are reused
(b) Continuous Batching (Orca/vLLM). Static batching: group N requests, wait for ALL to finish before accepting new ones. Problem: a 10-token request finishes in 300 ms; a 2000-token request takes 60 seconds. The GPU idles on the short request's slot for 59.7 seconds. Continuous batching: new requests join the batch mid-flight, and finished requests leave immediately. GPU utilization jumps from ~30% (static) to ~80% (continuous). vLLM's PagedAttention further optimizes by managing KV-cache memory like OS virtual memory pages — no fragmentation, no wasted HBM.
PagedAttention detail: traditional KV-cache allocates a contiguous block of GPU memory per request, sized for max_tokens. A 128K-context request reserves ~2 GB even if it only generates 50 tokens. Wasted memory = fewer concurrent requests. PagedAttention splits the KV-cache into fixed-size pages (e.g., 16 tokens each). Pages are allocated on demand and can be non-contiguous — exactly like virtual memory. A request that generates 50 tokens only uses 4 pages (~30 MB), not the full 2 GB reservation. This alone increases throughput by ~2-4x on long-context workloads.
# Continuous batching pseudocode (simplified)
while True:
# Check for finished requests — remove from batch
for req in active_batch:
if req.last_token == EOS or req.num_tokens >= req.max_tokens:
active_batch.remove(req)
send_done(req)
free_kv_pages(req)
# Fill empty slots with waiting requests
while len(active_batch) < max_batch_size and queue.not_empty():
new_req = queue.pop()
allocate_kv_pages(new_req)
prefill(new_req) # encode prompt
active_batch.add(new_req)
# One decode step for all active requests in parallel
next_tokens = decode_step(active_batch) # single GPU kernel
for req, token in zip(active_batch, next_tokens):
stream_token(req, token)
(c) Tensor + Pipeline Parallelism. A 70B-parameter model doesn't fit on one A100 (80 GB HBM — model weights alone are ~140 GB in FP16). Tensor parallelism (TP): split each layer across N GPUs. Each GPU computes 1/N of each layer, then all-reduce to sync. Low latency but high bandwidth requirement (NVLink). Pipeline parallelism (PP): split layers sequentially across GPUs. GPU 1 handles layers 1-20, GPU 2 handles 21-40, etc. Lower bandwidth needs but introduces pipeline bubbles. In practice: combine TP=8 within a node (NVLink) + PP=4 across nodes (InfiniBand) for a 70B model on 32 GPUs.
Concrete math for a 70B model: 70 billion params x 2 bytes (FP16) = 140 GB just for weights. One A100 has 80 GB HBM, minus ~10 GB for KV-cache and activations = ~70 GB usable for weights. So you need at least TP=2 just to load the model. In practice, TP=8 across one DGX node (8 A100s connected via NVLink at 600 GB/s) gives each GPU ~17.5 GB of weights, leaving plenty of room for KV-cache. For even larger models (405B), you need PP=4 across 4 nodes + TP=8 within each node = 32 GPUs total. The all-reduce in TP adds ~0.5 ms per layer; pipeline bubbles in PP waste ~15-20% of compute but are amortized across large batches.
(d) Safety Pipeline. Three layers: (1) Pre-inference classifier — a lightweight model (~1B params) screens the prompt. Obviously harmful prompts rejected before reaching the expensive 70B model. Latency: ~5 ms. Catches ~80% of harmful prompts at this stage. (2) Post-inference classifier — scans generated output token-by-token for harmful content; can halt generation mid-stream. Runs asynchronously alongside generation so it doesn't add latency. (3) RLHF alignment — the model itself is trained via reinforcement learning from human feedback to refuse harmful requests politely. This is the deepest defense: the model's weights encode the policy, so even novel attack patterns that bypass classifiers are often caught.
Why three layers and not just one? Each layer catches different threats. The pre-filter is fast but shallow — it catches "how to build a bomb" but misses subtle jailbreaks. The post-filter has more context (it sees the actual output) and catches harmful content the model generated despite RLHF. RLHF catches the long tail — novel prompts that neither classifier was trained on. Defense in depth: if any one layer fails, the others provide coverage. Regular red-teaming (adversarial testing) feeds new attack vectors back into classifier training data.
(e) Streaming. Server-sent events (SSE) stream each token as it's generated. The client receives data: {"delta": "Hello"} events ~30 ms apart. Time-to-first-token (TTFT) — the delay before the first token appears — is the critical UX metric. TTFT is dominated by the "prefill" phase: encoding the entire prompt through all layers. For a 10K-token prompt on a 70B model, prefill takes ~500 ms. After that, each subsequent token takes ~30 ms (decode phase).
SSE implementation: the API gateway holds a long-lived HTTP connection. The inference engine writes tokens to a ring buffer; a streamer goroutine reads from the buffer and flushes SSE frames to the client. Connection drops are handled gracefully — the client reconnects with a Last-Event-ID header and the server replays missed tokens from the buffer. For the client, rendering tokens progressively as they arrive creates the illusion of a fast response even when total generation takes 30+ seconds. Studies show users perceive streaming responses as ~5x faster than waiting for the complete response.
// SSE streaming format
HTTP/1.1 200 OK
Content-Type: text/event-stream
id: 1
data: {"choices":[{"delta":{"content":"Hello"}}]}
id: 2
data: {"choices":[{"delta":{"content":" world"}}]}
id: 3
data: [DONE]
Inference Request LifecycleMermaid
sequenceDiagram
participant U as User
participant GW as API Gateway
participant SF as Safety Pre-filter
participant TK as Tokenizer
participant GPU as GPU Cluster
participant PF as Safety Post-filter
participant SSE as SSE Streamer
U->>GW: POST /chat/completions (stream:true)
GW->>SF: classify prompt
SF-->>GW: safe
GW->>TK: tokenize prompt
TK->>GPU: prefill (encode all tokens)
GPU-->>SSE: token 1 (TTFT ~500ms)
GPU-->>PF: token 1 safety check
PF-->>SSE: pass
SSE-->>U: data: token 1
GPU-->>SSE: token 2 (~30ms later)
SSE-->>U: data: token 2
Note over GPU,SSE: continues until EOS or max_tokens
SSE-->>U: data: [DONE]
Interview answer
"User sends a prompt to the API gateway. After safety pre-screening, it's tokenized and sent to a GPU cluster running the 70B model with tensor parallelism (8-way within node) and pipeline parallelism (across nodes). KV-cache avoids O(N squared) recomputation — each new token only computes its own attention against cached K/V from prior tokens. Continuous batching (vLLM-style) keeps GPU utilization at ~80% by letting requests join/leave the batch dynamically. Tokens stream back via SSE; TTFT under 1 second. Post-inference safety classifier filters output mid-stream. Billing meters input + output tokens via Kafka."