ChatGPT

User sends a prompt. API gateway routes it through a tokenizer to a GPU cluster of ~10K A100s running autoregressive inference. Tokens stream back via server-sent events — the user sees words appearing one by one. The hard parts: a KV-cache that avoids O(N²) recomputation on every token, continuous batching (vLLM/Orca) that keeps GPU utilization at ~80% instead of ~30%, tensor + pipeline parallelism to shard 70B+ parameter models across multiple GPUs, a safety pipeline (pre-classifier + post-classifier + RLHF alignment), and per-token billing that meters usage accurately at 1B tokens/day.

Core: KV-Cache + Continuous Batching + Tensor Parallelism~100M users~1B tokens/day~10K A100 GPUsStreaming SSE

Requirements

Functional

User sends a prompt (text + optional images); receives a streamed completion
Multi-turn conversation: full chat history sent as context on each request
Streaming response via server-sent events — tokens appear word-by-word
Safety pipeline: pre-inference classifier rejects harmful prompts; post-inference classifier filters toxic output
Per-token billing: meter input tokens + output tokens; charge per 1K tokens
Conversation storage: users can view, continue, and delete past chats

Non-Functional

Time-to-first-token < 1 sec — the key UX metric for perceived speed
Throughput: ~10K concurrent inference requests across the GPU cluster
GPU utilization > 70% — GPUs cost $2/hr each; idle = burning money
Support context windows up to 128K tokens per request
99.9% availability — degraded mode (shorter context, slower model) over downtime
Safety: < 0.1% harmful output rate; multi-layer defense

Scale Estimation

Registered users

~100M

DAU ~10M; concurrent sessions ~500K at peak

Tokens per day

~1B

input + output combined; ~60% input, ~40% output

GPU cluster

~10K A100s

80 GB HBM each; NVLink + InfiniBand interconnect

Concurrent inferences

~10K

requests actively generating tokens at any moment

KV-cache per request

~1.5 GB

for 100K-context on a 70B model; scales linearly with context

Time-to-first-token

< 1 sec

prefill (encode prompt) dominates; decode is ~30 ms/token

API Design

POST/api/chat/completions

Send a prompt. Body: {model, messages: [{role, content}], max_tokens, temperature, stream: true}. Returns SSE stream: data: {"choices":[{"delta":{"content":"Hello"}}]}.

GET/api/conversations/{conv_id}

Retrieve conversation history. Returns {id, messages: [{role, content, created_at}], model, total_tokens}.

DELETE/api/conversations/{conv_id}

Delete a conversation. Soft-delete; purged from storage after 30 days.

GET/api/usage

Token usage and billing. Returns {period, input_tokens, output_tokens, cost_usd}. Metered per-request via billing pipeline.

GET/api/models

List available models with context window sizes, pricing, and capabilities.

Architecture

Four tiers: API tier (gateway, auth, rate-limit, routing), Inference tier (GPU cluster with tensor/pipeline parallelism), Safety tier (pre + post classifiers), Storage tier (conversations, billing, model weights). Streaming SSE connects API tier directly to client.

ChatGPT Inference ArchitectureSVG

Request Flow — Step Through

User · sends prompt→API Gateway · auth + rate-limit→Safety Pre-filter · reject harmful→Tokenizer · encode prompt→GPU Cluster · KV-cache + cont. batch→Safety Post-filter · filter output→SSE Streamer · token-by-token

Click Next Step to walk through the request flow.

Deep Dive — KV-Cache, Continuous Batching & Parallelism

(a) KV-Cache. During autoregressive generation, each new token's attention layer needs the keys and values from ALL prior tokens. Without a cache, generating token N requires recomputing attention over all N-1 previous tokens — O(N²) total work for a sequence of length N. The KV-cache stores the K and V tensors from every prior layer/token in GPU HBM. Each new token only computes its own K/V, appends to the cache, and attends over the cached values. Cost: ~1.5 GB of GPU memory per 100K-context request on a 70B model. This is why long-context requests are expensive — they consume GPU memory, not just compute.

// KV-cache pseudocode per attention layer
cache_k[layer].append(new_token_k)  // shape: [seq_len, head_dim]
cache_v[layer].append(new_token_v)
attn_output = softmax(new_token_q @ cache_k.T / sqrt(d)) @ cache_v
// Only new_token_q is computed fresh; cache_k, cache_v are reused

(b) Continuous Batching (Orca/vLLM). Static batching: group N requests, wait for ALL to finish before accepting new ones. Problem: a 10-token request finishes in 300 ms; a 2000-token request takes 60 seconds. The GPU idles on the short request's slot for 59.7 seconds. Continuous batching: new requests join the batch mid-flight, and finished requests leave immediately. GPU utilization jumps from ~30% (static) to ~80% (continuous). vLLM's PagedAttention further optimizes by managing KV-cache memory like OS virtual memory pages — no fragmentation, no wasted HBM.

PagedAttention detail: traditional KV-cache allocates a contiguous block of GPU memory per request, sized for max_tokens. A 128K-context request reserves ~2 GB even if it only generates 50 tokens. Wasted memory = fewer concurrent requests. PagedAttention splits the KV-cache into fixed-size pages (e.g., 16 tokens each). Pages are allocated on demand and can be non-contiguous — exactly like virtual memory. A request that generates 50 tokens only uses 4 pages (~30 MB), not the full 2 GB reservation. This alone increases throughput by ~2-4x on long-context workloads.

# Continuous batching pseudocode (simplified)
while True:
    # Check for finished requests — remove from batch
    for req in active_batch:
        if req.last_token == EOS or req.num_tokens >= req.max_tokens:
            active_batch.remove(req)
            send_done(req)
            free_kv_pages(req)

    # Fill empty slots with waiting requests
    while len(active_batch) < max_batch_size and queue.not_empty():
        new_req = queue.pop()
        allocate_kv_pages(new_req)
        prefill(new_req)               # encode prompt
        active_batch.add(new_req)

    # One decode step for all active requests in parallel
    next_tokens = decode_step(active_batch)  # single GPU kernel
    for req, token in zip(active_batch, next_tokens):
        stream_token(req, token)

(c) Tensor + Pipeline Parallelism. A 70B-parameter model doesn't fit on one A100 (80 GB HBM — model weights alone are ~140 GB in FP16). Tensor parallelism (TP): split each layer across N GPUs. Each GPU computes 1/N of each layer, then all-reduce to sync. Low latency but high bandwidth requirement (NVLink). Pipeline parallelism (PP): split layers sequentially across GPUs. GPU 1 handles layers 1-20, GPU 2 handles 21-40, etc. Lower bandwidth needs but introduces pipeline bubbles. In practice: combine TP=8 within a node (NVLink) + PP=4 across nodes (InfiniBand) for a 70B model on 32 GPUs.

Concrete math for a 70B model: 70 billion params x 2 bytes (FP16) = 140 GB just for weights. One A100 has 80 GB HBM, minus ~10 GB for KV-cache and activations = ~70 GB usable for weights. So you need at least TP=2 just to load the model. In practice, TP=8 across one DGX node (8 A100s connected via NVLink at 600 GB/s) gives each GPU ~17.5 GB of weights, leaving plenty of room for KV-cache. For even larger models (405B), you need PP=4 across 4 nodes + TP=8 within each node = 32 GPUs total. The all-reduce in TP adds ~0.5 ms per layer; pipeline bubbles in PP waste ~15-20% of compute but are amortized across large batches.

(d) Safety Pipeline. Three layers: (1) Pre-inference classifier — a lightweight model (~1B params) screens the prompt. Obviously harmful prompts rejected before reaching the expensive 70B model. Latency: ~5 ms. Catches ~80% of harmful prompts at this stage. (2) Post-inference classifier — scans generated output token-by-token for harmful content; can halt generation mid-stream. Runs asynchronously alongside generation so it doesn't add latency. (3) RLHF alignment — the model itself is trained via reinforcement learning from human feedback to refuse harmful requests politely. This is the deepest defense: the model's weights encode the policy, so even novel attack patterns that bypass classifiers are often caught.

Why three layers and not just one? Each layer catches different threats. The pre-filter is fast but shallow — it catches "how to build a bomb" but misses subtle jailbreaks. The post-filter has more context (it sees the actual output) and catches harmful content the model generated despite RLHF. RLHF catches the long tail — novel prompts that neither classifier was trained on. Defense in depth: if any one layer fails, the others provide coverage. Regular red-teaming (adversarial testing) feeds new attack vectors back into classifier training data.

(e) Streaming. Server-sent events (SSE) stream each token as it's generated. The client receives data: {"delta": "Hello"} events ~30 ms apart. Time-to-first-token (TTFT) — the delay before the first token appears — is the critical UX metric. TTFT is dominated by the "prefill" phase: encoding the entire prompt through all layers. For a 10K-token prompt on a 70B model, prefill takes ~500 ms. After that, each subsequent token takes ~30 ms (decode phase).

SSE implementation: the API gateway holds a long-lived HTTP connection. The inference engine writes tokens to a ring buffer; a streamer goroutine reads from the buffer and flushes SSE frames to the client. Connection drops are handled gracefully — the client reconnects with a Last-Event-ID header and the server replays missed tokens from the buffer. For the client, rendering tokens progressively as they arrive creates the illusion of a fast response even when total generation takes 30+ seconds. Studies show users perceive streaming responses as ~5x faster than waiting for the complete response.

// SSE streaming format
HTTP/1.1 200 OK
Content-Type: text/event-stream

id: 1
data: {"choices":[{"delta":{"content":"Hello"}}]}

id: 2
data: {"choices":[{"delta":{"content":" world"}}]}

id: 3
data: [DONE]

Inference Request LifecycleMermaid

sequenceDiagram participant U as User participant GW as API Gateway participant SF as Safety Pre-filter participant TK as Tokenizer participant GPU as GPU Cluster participant PF as Safety Post-filter participant SSE as SSE Streamer U->>GW: POST /chat/completions (stream:true) GW->>SF: classify prompt SF-->>GW: safe GW->>TK: tokenize prompt TK->>GPU: prefill (encode all tokens) GPU-->>SSE: token 1 (TTFT ~500ms) GPU-->>PF: token 1 safety check PF-->>SSE: pass SSE-->>U: data: token 1 GPU-->>SSE: token 2 (~30ms later) SSE-->>U: data: token 2 Note over GPU,SSE: continues until EOS or max_tokens SSE-->>U: data: [DONE]

Interview answer

"User sends a prompt to the API gateway. After safety pre-screening, it's tokenized and sent to a GPU cluster running the 70B model with tensor parallelism (8-way within node) and pipeline parallelism (across nodes). KV-cache avoids O(N squared) recomputation — each new token only computes its own attention against cached K/V from prior tokens. Continuous batching (vLLM-style) keeps GPU utilization at ~80% by letting requests join/leave the batch dynamically. Tokens stream back via SSE; TTFT under 1 second. Post-inference safety classifier filters output mid-stream. Billing meters input + output tokens via Kafka."

Anti-patterns

🚫

Static batching — wait for the longest request in the batch to finish before accepting new ones

A 10-token request finishes in 300 ms but the GPU slot sits idle for 59.7 seconds waiting for the 2000-token request in the same batch. GPU utilization drops to ~30%. You're paying $20K/hr for a 10K-GPU cluster at 30% utilization.

Better: Continuous batching. Finished requests leave immediately; new requests join mid-batch. GPU utilization ~80%.

🚫

No KV-cache — recompute all attention from scratch for every new token

Generating a 1000-token response requires 1000 forward passes. Without cache, pass N recomputes attention over all N-1 prior tokens. Total: O(N squared) compute. A response that takes 30 seconds with cache takes 50+ minutes without it.

Better: KV-cache stores prior K/V tensors in GPU HBM. Each new token computes only its own attention. O(N) total.

🚫

Safety filter only after full generation — harmful content already generated and cached

The model generates a complete harmful response, stores it in the conversation cache, then the filter catches it. The harmful content existed in memory and may have been logged. Wasted GPU compute on content that gets thrown away.

Better: Pre-inference classifier rejects harmful prompts before touching the expensive GPU. Post-inference filter runs token-by-token mid-stream and can halt early.

Tradeoffs & Design Choices

KV-cache memory vs compute. Caching K/V tensors uses ~1.5 GB per 100K-context request. On an 80 GB A100, this limits concurrent requests per GPU. But the alternative (recompute) is 100x slower. The tradeoff: serve fewer concurrent long-context requests, or more concurrent short-context requests.
Tensor parallelism vs pipeline parallelism. TP: lower latency (all GPUs work on each token), but requires high-bandwidth NVLink (~600 GB/s). PP: works over slower InfiniBand (~200 GB/s), but pipeline bubbles waste ~20% of compute. Combine both: TP within a node, PP across nodes.
Streaming vs batch response. Streaming (SSE) reduces perceived latency dramatically — user starts reading immediately. But it complicates safety filtering (must check mid-stream) and billing (tokens counted incrementally). Non-streaming is simpler but feels 10x slower to the user.
Model size vs latency. Larger models (70B+) produce higher-quality output but have higher TTFT and lower throughput. Smaller models (7B) are 10x faster but less capable. Solution: route simple queries to small models, complex queries to large models. Cost optimization.
Safety strictness vs utility. Overly aggressive pre-filters reject legitimate queries (false positives). Under-filtering allows harmful content. The balance: high-recall pre-filter (catch obviously harmful), nuanced post-filter (context-aware), and RLHF alignment (model learns appropriate refusals).
Prefix caching vs fresh computation. Many requests share the same system prompt (e.g., "You are a helpful assistant..."). Caching the KV-cache for common prefixes saves ~200 ms of prefill per request. But it consumes persistent GPU memory and requires eviction logic. Worth it for system prompts used by millions of requests/day.
Speculative decoding tradeoff. A small draft model proposes N tokens; the large model verifies in one forward pass. If the draft is accurate, you get N tokens for the cost of 1 large-model step — 2-3x throughput gain. But if the draft is bad (complex reasoning), rejection rate is high and you've wasted the draft model's compute. Best for predictable text (code completion, boilerplate).

Failure Modes

💥

GPU node failure mid-inference

One of 8 GPUs in a tensor-parallel group dies. The entire request fails because all 8 GPUs must synchronize.

→ Mitigation: request-level retry on a different GPU group. Checkpoint prefill state so retry skips prompt encoding. Spare GPU pools for fast failover.

🔥

KV-cache OOM — too many long-context requests

50 concurrent 128K-context requests on one node; each needs ~2 GB of KV-cache. 100 GB total exceeds 80 GB HBM. OOM crash.

→ Mitigation: admission control per GPU based on available HBM. Queue long-context requests. vLLM's PagedAttention swaps cold KV pages to CPU memory.

🔄

Prompt injection bypasses safety filter

Adversarial prompt tricks the pre-classifier into rating it safe. Model generates harmful content.

→ Mitigation: defense in depth — pre-filter + post-filter + RLHF. Post-filter catches what pre-filter misses. Regular red-teaming to update classifiers.

⏰

Thundering herd after outage recovery

GPU cluster comes back online after 10-min outage. 500K queued requests hit simultaneously. Cluster overloads again.

→ Mitigation: rate-limited drain of queued requests. Priority tiers (paid users first). Exponential backoff on client retries.

💸

Billing pipeline lag — tokens generated but not metered

Kafka consumer falls behind. Users generate tokens that aren't billed. Revenue leakage.

→ Mitigation: synchronous token count in the response path (best-effort billing inline). Kafka for authoritative reconciliation. Alert on consumer lag > 5 min.

Interview Tips

Lead with KV-cache. "Without it, generation is O(N squared). With it, O(N). This is the single most important optimization." Shows you understand transformer internals.
Name continuous batching. "vLLM-style continuous batching keeps GPU utilization at 80% vs 30% for static batching." This is the insight that separates candidates who've worked with inference systems.
Explain the parallelism strategy. "Tensor parallelism within a node (NVLink), pipeline parallelism across nodes (InfiniBand). Combine both for 70B+ models." Concrete and correct.
TTFT is the UX metric. "Time-to-first-token matters more than tokens-per-second for user perception. Prefill dominates TTFT." Shows product thinking.
Safety is multi-layer. "Pre-filter rejects before GPU, post-filter catches mid-stream, RLHF aligns the model itself." Don't just say "add a filter" — describe the pipeline.

Evolution

MVP — Single GPU, one request at a time

7B model on one A100. No batching. No KV-cache optimization. Handles ~1 request/sec. Good enough for a demo or internal prototype. Total cost: $2/hr for one GPU.

Static batching + KV-cache

Batch 8 requests together. KV-cache avoids O(N squared) recomputation — each new token reuses cached K/V. Throughput ~10 req/sec. GPU utilization only ~30% because short requests waste slots waiting for long ones.

Continuous batching + tensor parallelism

vLLM-style continuous batching with PagedAttention. 70B model across 8 GPUs with tensor parallelism (NVLink). GPU utilization jumps to ~80%. Throughput ~100 req/sec per node. First production-grade setup.

Multi-node cluster + pipeline parallelism + safety + streaming

10K GPUs across hundreds of nodes. TP=8 within each DGX node, PP across nodes via InfiniBand. SSE streaming for low perceived latency. Three-layer safety pipeline (pre-filter, post-filter, RLHF). Per-token billing via Kafka. ~10K concurrent inference requests.

Model routing + speculative decoding + multimodal

Route simple queries ("what's 2+2?") to a fast 7B model; complex queries to the full 70B. Speculative decoding: a small draft model proposes N tokens, the large model verifies in one forward pass — 2x throughput for free. Vision encoder for image inputs, audio encoder for voice. Prefix caching for system prompts shared across requests.

📺

References & Videos

How ChatGPT Works Technically

ByteByteGo · 10 min

LLM Inference Optimization

Arpit Bhayani · 30 min

The Transformer Family

Lilian Weng (OpenAI)

vLLM: Easy, Fast, and Cheap LLM Serving

vLLM Project

Next up

PROBLEM

Recommendation Algorithm

ML inference at scale with similar GPU serving challenges

Read →

PROBLEM

Search Engine

Query processing pipeline with ranking and streaming results

Read →

ChatGPT

Requirements

Scale Estimation

API Design

Architecture

Deep Dive — KV-Cache, Continuous Batching & Parallelism

Anti-patterns

Tradeoffs & Design Choices

Failure Modes

Interview Tips

Similar Problems

Recommendation Algorithm

Search Engine

Fraud Detection

Distributed Job Scheduler

Video Conferencing

Evolution

MVP — Single GPU, one request at a time

Static batching + KV-cache

Continuous batching + tensor parallelism

Multi-node cluster + pipeline parallelism + safety + streaming

Model routing + speculative decoding + multimodal

References & Videos

Recommendation Algorithm

Search Engine