ChatGPT

User sends a prompt. API gateway routes it through a tokenizer to a GPU cluster of ~10K A100s running autoregressive inference. Tokens stream back via server-sent events — the user sees words appearing one by one. The hard parts: a KV-cache that avoids O(N²) recomputation on every token, continuous batching (vLLM/Orca) that keeps GPU utilization at ~80% instead of ~30%, tensor + pipeline parallelism to shard 70B+ parameter models across multiple GPUs, a safety pipeline (pre-classifier + post-classifier + RLHF alignment), and per-token billing that meters usage accurately at 1B tokens/day.

Core: KV-Cache + Continuous Batching + Tensor Parallelism~100M users~1B tokens/day~10K A100 GPUsStreaming SSE
02

Requirements

Functional
  • User sends a prompt (text + optional images); receives a streamed completion
  • Multi-turn conversation: full chat history sent as context on each request
  • Streaming response via server-sent events — tokens appear word-by-word
  • Safety pipeline: pre-inference classifier rejects harmful prompts; post-inference classifier filters toxic output
  • Per-token billing: meter input tokens + output tokens; charge per 1K tokens
  • Conversation storage: users can view, continue, and delete past chats
Non-Functional
  • Time-to-first-token < 1 sec — the key UX metric for perceived speed
  • Throughput: ~10K concurrent inference requests across the GPU cluster
  • GPU utilization > 70% — GPUs cost $2/hr each; idle = burning money
  • Support context windows up to 128K tokens per request
  • 99.9% availability — degraded mode (shorter context, slower model) over downtime
  • Safety: < 0.1% harmful output rate; multi-layer defense
03

Scale Estimation

Registered users
~100M
DAU ~10M; concurrent sessions ~500K at peak
Tokens per day
~1B
input + output combined; ~60% input, ~40% output
GPU cluster
~10K A100s
80 GB HBM each; NVLink + InfiniBand interconnect
Concurrent inferences
~10K
requests actively generating tokens at any moment
KV-cache per request
~1.5 GB
for 100K-context on a 70B model; scales linearly with context
Time-to-first-token
< 1 sec
prefill (encode prompt) dominates; decode is ~30 ms/token
04

API Design

POST/api/chat/completions

Send a prompt. Body: {model, messages: [{role, content}], max_tokens, temperature, stream: true}. Returns SSE stream: data: {"choices":[{"delta":{"content":"Hello"}}]}.

GET/api/conversations/{conv_id}

Retrieve conversation history. Returns {id, messages: [{role, content, created_at}], model, total_tokens}.

DELETE/api/conversations/{conv_id}

Delete a conversation. Soft-delete; purged from storage after 30 days.

GET/api/usage

Token usage and billing. Returns {period, input_tokens, output_tokens, cost_usd}. Metered per-request via billing pipeline.

GET/api/models

List available models with context window sizes, pricing, and capabilities.

05

Architecture

Four tiers: API tier (gateway, auth, rate-limit, routing), Inference tier (GPU cluster with tensor/pipeline parallelism), Safety tier (pre + post classifiers), Storage tier (conversations, billing, model weights). Streaming SSE connects API tier directly to client.

ChatGPT Inference ArchitectureSVG
User (100M)browser / API client API Gatewayauth + rate-limit Safety Pre-filterreject harmful prompts Tokenizer + Routerencode promptselect GPU shard GPU Inference10K A100s, TP + PPKV-cache + cont. batch Safety Post-filterfilter toxic output SSE Streamertoken-by-token to client Billing Pipelinemeter per token Conversation Store: Postgres + Redis cachechat history, user prefs, model configs Model Weight StoreS3 → GPU HBM on load Kafka: token events → billing aggregation + usage analytics + abuse detection
Request Flow — Step Through
User · sends promptAPI Gateway · auth + rate-limitSafety Pre-filter · reject harmfulTokenizer · encode promptGPU Cluster · KV-cache + cont. batchSafety Post-filter · filter outputSSE Streamer · token-by-token
Click Next Step to walk through the request flow.
06

Deep Dive — KV-Cache, Continuous Batching & Parallelism

(a) KV-Cache. During autoregressive generation, each new token's attention layer needs the keys and values from ALL prior tokens. Without a cache, generating token N requires recomputing attention over all N-1 previous tokens — O(N²) total work for a sequence of length N. The KV-cache stores the K and V tensors from every prior layer/token in GPU HBM. Each new token only computes its own K/V, appends to the cache, and attends over the cached values. Cost: ~1.5 GB of GPU memory per 100K-context request on a 70B model. This is why long-context requests are expensive — they consume GPU memory, not just compute.

// KV-cache pseudocode per attention layer
cache_k[layer].append(new_token_k)  // shape: [seq_len, head_dim]
cache_v[layer].append(new_token_v)
attn_output = softmax(new_token_q @ cache_k.T / sqrt(d)) @ cache_v
// Only new_token_q is computed fresh; cache_k, cache_v are reused

(b) Continuous Batching (Orca/vLLM). Static batching: group N requests, wait for ALL to finish before accepting new ones. Problem: a 10-token request finishes in 300 ms; a 2000-token request takes 60 seconds. The GPU idles on the short request's slot for 59.7 seconds. Continuous batching: new requests join the batch mid-flight, and finished requests leave immediately. GPU utilization jumps from ~30% (static) to ~80% (continuous). vLLM's PagedAttention further optimizes by managing KV-cache memory like OS virtual memory pages — no fragmentation, no wasted HBM.

PagedAttention detail: traditional KV-cache allocates a contiguous block of GPU memory per request, sized for max_tokens. A 128K-context request reserves ~2 GB even if it only generates 50 tokens. Wasted memory = fewer concurrent requests. PagedAttention splits the KV-cache into fixed-size pages (e.g., 16 tokens each). Pages are allocated on demand and can be non-contiguous — exactly like virtual memory. A request that generates 50 tokens only uses 4 pages (~30 MB), not the full 2 GB reservation. This alone increases throughput by ~2-4x on long-context workloads.

# Continuous batching pseudocode (simplified)
while True:
    # Check for finished requests — remove from batch
    for req in active_batch:
        if req.last_token == EOS or req.num_tokens >= req.max_tokens:
            active_batch.remove(req)
            send_done(req)
            free_kv_pages(req)

    # Fill empty slots with waiting requests
    while len(active_batch) < max_batch_size and queue.not_empty():
        new_req = queue.pop()
        allocate_kv_pages(new_req)
        prefill(new_req)               # encode prompt
        active_batch.add(new_req)

    # One decode step for all active requests in parallel
    next_tokens = decode_step(active_batch)  # single GPU kernel
    for req, token in zip(active_batch, next_tokens):
        stream_token(req, token)

(c) Tensor + Pipeline Parallelism. A 70B-parameter model doesn't fit on one A100 (80 GB HBM — model weights alone are ~140 GB in FP16). Tensor parallelism (TP): split each layer across N GPUs. Each GPU computes 1/N of each layer, then all-reduce to sync. Low latency but high bandwidth requirement (NVLink). Pipeline parallelism (PP): split layers sequentially across GPUs. GPU 1 handles layers 1-20, GPU 2 handles 21-40, etc. Lower bandwidth needs but introduces pipeline bubbles. In practice: combine TP=8 within a node (NVLink) + PP=4 across nodes (InfiniBand) for a 70B model on 32 GPUs.

Concrete math for a 70B model: 70 billion params x 2 bytes (FP16) = 140 GB just for weights. One A100 has 80 GB HBM, minus ~10 GB for KV-cache and activations = ~70 GB usable for weights. So you need at least TP=2 just to load the model. In practice, TP=8 across one DGX node (8 A100s connected via NVLink at 600 GB/s) gives each GPU ~17.5 GB of weights, leaving plenty of room for KV-cache. For even larger models (405B), you need PP=4 across 4 nodes + TP=8 within each node = 32 GPUs total. The all-reduce in TP adds ~0.5 ms per layer; pipeline bubbles in PP waste ~15-20% of compute but are amortized across large batches.

(d) Safety Pipeline. Three layers: (1) Pre-inference classifier — a lightweight model (~1B params) screens the prompt. Obviously harmful prompts rejected before reaching the expensive 70B model. Latency: ~5 ms. Catches ~80% of harmful prompts at this stage. (2) Post-inference classifier — scans generated output token-by-token for harmful content; can halt generation mid-stream. Runs asynchronously alongside generation so it doesn't add latency. (3) RLHF alignment — the model itself is trained via reinforcement learning from human feedback to refuse harmful requests politely. This is the deepest defense: the model's weights encode the policy, so even novel attack patterns that bypass classifiers are often caught.

Why three layers and not just one? Each layer catches different threats. The pre-filter is fast but shallow — it catches "how to build a bomb" but misses subtle jailbreaks. The post-filter has more context (it sees the actual output) and catches harmful content the model generated despite RLHF. RLHF catches the long tail — novel prompts that neither classifier was trained on. Defense in depth: if any one layer fails, the others provide coverage. Regular red-teaming (adversarial testing) feeds new attack vectors back into classifier training data.

(e) Streaming. Server-sent events (SSE) stream each token as it's generated. The client receives data: {"delta": "Hello"} events ~30 ms apart. Time-to-first-token (TTFT) — the delay before the first token appears — is the critical UX metric. TTFT is dominated by the "prefill" phase: encoding the entire prompt through all layers. For a 10K-token prompt on a 70B model, prefill takes ~500 ms. After that, each subsequent token takes ~30 ms (decode phase).

SSE implementation: the API gateway holds a long-lived HTTP connection. The inference engine writes tokens to a ring buffer; a streamer goroutine reads from the buffer and flushes SSE frames to the client. Connection drops are handled gracefully — the client reconnects with a Last-Event-ID header and the server replays missed tokens from the buffer. For the client, rendering tokens progressively as they arrive creates the illusion of a fast response even when total generation takes 30+ seconds. Studies show users perceive streaming responses as ~5x faster than waiting for the complete response.

// SSE streaming format
HTTP/1.1 200 OK
Content-Type: text/event-stream

id: 1
data: {"choices":[{"delta":{"content":"Hello"}}]}

id: 2
data: {"choices":[{"delta":{"content":" world"}}]}

id: 3
data: [DONE]
Inference Request LifecycleMermaid
sequenceDiagram participant U as User participant GW as API Gateway participant SF as Safety Pre-filter participant TK as Tokenizer participant GPU as GPU Cluster participant PF as Safety Post-filter participant SSE as SSE Streamer U->>GW: POST /chat/completions (stream:true) GW->>SF: classify prompt SF-->>GW: safe GW->>TK: tokenize prompt TK->>GPU: prefill (encode all tokens) GPU-->>SSE: token 1 (TTFT ~500ms) GPU-->>PF: token 1 safety check PF-->>SSE: pass SSE-->>U: data: token 1 GPU-->>SSE: token 2 (~30ms later) SSE-->>U: data: token 2 Note over GPU,SSE: continues until EOS or max_tokens SSE-->>U: data: [DONE]
Interview answer

"User sends a prompt to the API gateway. After safety pre-screening, it's tokenized and sent to a GPU cluster running the 70B model with tensor parallelism (8-way within node) and pipeline parallelism (across nodes). KV-cache avoids O(N squared) recomputation — each new token only computes its own attention against cached K/V from prior tokens. Continuous batching (vLLM-style) keeps GPU utilization at ~80% by letting requests join/leave the batch dynamically. Tokens stream back via SSE; TTFT under 1 second. Post-inference safety classifier filters output mid-stream. Billing meters input + output tokens via Kafka."

08

Anti-patterns

🚫
Static batching — wait for the longest request in the batch to finish before accepting new ones

A 10-token request finishes in 300 ms but the GPU slot sits idle for 59.7 seconds waiting for the 2000-token request in the same batch. GPU utilization drops to ~30%. You're paying $20K/hr for a 10K-GPU cluster at 30% utilization.

Better: Continuous batching. Finished requests leave immediately; new requests join mid-batch. GPU utilization ~80%.
🚫
No KV-cache — recompute all attention from scratch for every new token

Generating a 1000-token response requires 1000 forward passes. Without cache, pass N recomputes attention over all N-1 prior tokens. Total: O(N squared) compute. A response that takes 30 seconds with cache takes 50+ minutes without it.

Better: KV-cache stores prior K/V tensors in GPU HBM. Each new token computes only its own attention. O(N) total.
🚫
Safety filter only after full generation — harmful content already generated and cached

The model generates a complete harmful response, stores it in the conversation cache, then the filter catches it. The harmful content existed in memory and may have been logged. Wasted GPU compute on content that gets thrown away.

Better: Pre-inference classifier rejects harmful prompts before touching the expensive GPU. Post-inference filter runs token-by-token mid-stream and can halt early.
09

Tradeoffs & Design Choices

  • KV-cache memory vs compute. Caching K/V tensors uses ~1.5 GB per 100K-context request. On an 80 GB A100, this limits concurrent requests per GPU. But the alternative (recompute) is 100x slower. The tradeoff: serve fewer concurrent long-context requests, or more concurrent short-context requests.
  • Tensor parallelism vs pipeline parallelism. TP: lower latency (all GPUs work on each token), but requires high-bandwidth NVLink (~600 GB/s). PP: works over slower InfiniBand (~200 GB/s), but pipeline bubbles waste ~20% of compute. Combine both: TP within a node, PP across nodes.
  • Streaming vs batch response. Streaming (SSE) reduces perceived latency dramatically — user starts reading immediately. But it complicates safety filtering (must check mid-stream) and billing (tokens counted incrementally). Non-streaming is simpler but feels 10x slower to the user.
  • Model size vs latency. Larger models (70B+) produce higher-quality output but have higher TTFT and lower throughput. Smaller models (7B) are 10x faster but less capable. Solution: route simple queries to small models, complex queries to large models. Cost optimization.
  • Safety strictness vs utility. Overly aggressive pre-filters reject legitimate queries (false positives). Under-filtering allows harmful content. The balance: high-recall pre-filter (catch obviously harmful), nuanced post-filter (context-aware), and RLHF alignment (model learns appropriate refusals).
  • Prefix caching vs fresh computation. Many requests share the same system prompt (e.g., "You are a helpful assistant..."). Caching the KV-cache for common prefixes saves ~200 ms of prefill per request. But it consumes persistent GPU memory and requires eviction logic. Worth it for system prompts used by millions of requests/day.
  • Speculative decoding tradeoff. A small draft model proposes N tokens; the large model verifies in one forward pass. If the draft is accurate, you get N tokens for the cost of 1 large-model step — 2-3x throughput gain. But if the draft is bad (complex reasoning), rejection rate is high and you've wasted the draft model's compute. Best for predictable text (code completion, boilerplate).
10

Failure Modes

💥
GPU node failure mid-inference
One of 8 GPUs in a tensor-parallel group dies. The entire request fails because all 8 GPUs must synchronize.
→ Mitigation: request-level retry on a different GPU group. Checkpoint prefill state so retry skips prompt encoding. Spare GPU pools for fast failover.
🔥
KV-cache OOM — too many long-context requests
50 concurrent 128K-context requests on one node; each needs ~2 GB of KV-cache. 100 GB total exceeds 80 GB HBM. OOM crash.
→ Mitigation: admission control per GPU based on available HBM. Queue long-context requests. vLLM's PagedAttention swaps cold KV pages to CPU memory.
🔄
Prompt injection bypasses safety filter
Adversarial prompt tricks the pre-classifier into rating it safe. Model generates harmful content.
→ Mitigation: defense in depth — pre-filter + post-filter + RLHF. Post-filter catches what pre-filter misses. Regular red-teaming to update classifiers.
Thundering herd after outage recovery
GPU cluster comes back online after 10-min outage. 500K queued requests hit simultaneously. Cluster overloads again.
→ Mitigation: rate-limited drain of queued requests. Priority tiers (paid users first). Exponential backoff on client retries.
💸
Billing pipeline lag — tokens generated but not metered
Kafka consumer falls behind. Users generate tokens that aren't billed. Revenue leakage.
→ Mitigation: synchronous token count in the response path (best-effort billing inline). Kafka for authoritative reconciliation. Alert on consumer lag > 5 min.
11

Interview Tips

  1. Lead with KV-cache. "Without it, generation is O(N squared). With it, O(N). This is the single most important optimization." Shows you understand transformer internals.
  2. Name continuous batching. "vLLM-style continuous batching keeps GPU utilization at 80% vs 30% for static batching." This is the insight that separates candidates who've worked with inference systems.
  3. Explain the parallelism strategy. "Tensor parallelism within a node (NVLink), pipeline parallelism across nodes (InfiniBand). Combine both for 70B+ models." Concrete and correct.
  4. TTFT is the UX metric. "Time-to-first-token matters more than tokens-per-second for user perception. Prefill dominates TTFT." Shows product thinking.
  5. Safety is multi-layer. "Pre-filter rejects before GPU, post-filter catches mid-stream, RLHF aligns the model itself." Don't just say "add a filter" — describe the pipeline.
13

Evolution

1

MVP — Single GPU, one request at a time

7B model on one A100. No batching. No KV-cache optimization. Handles ~1 request/sec. Good enough for a demo or internal prototype. Total cost: $2/hr for one GPU.

2

Static batching + KV-cache

Batch 8 requests together. KV-cache avoids O(N squared) recomputation — each new token reuses cached K/V. Throughput ~10 req/sec. GPU utilization only ~30% because short requests waste slots waiting for long ones.

3

Continuous batching + tensor parallelism

vLLM-style continuous batching with PagedAttention. 70B model across 8 GPUs with tensor parallelism (NVLink). GPU utilization jumps to ~80%. Throughput ~100 req/sec per node. First production-grade setup.

4

Multi-node cluster + pipeline parallelism + safety + streaming

10K GPUs across hundreds of nodes. TP=8 within each DGX node, PP across nodes via InfiniBand. SSE streaming for low perceived latency. Three-layer safety pipeline (pre-filter, post-filter, RLHF). Per-token billing via Kafka. ~10K concurrent inference requests.

5

Model routing + speculative decoding + multimodal

Route simple queries ("what's 2+2?") to a fast 7B model; complex queries to the full 70B. Speculative decoding: a small draft model proposes N tokens, the large model verifies in one forward pass — 2x throughput for free. Vision encoder for image inputs, audio encoder for voice. Prefix caching for system prompts shared across requests.

Next up