Model Serving — Online vs Batch — Concept

01

Why this matters

You trained a recommendation model. Now what — does it predict in real time when the user opens the app, or do you precompute every user's recs nightly and read from a cache? Same model, completely different infrastructure. Get the wrong answer and you either burn 10× the compute or fail to personalize at all.

Online serving is sub-100ms inference per request. Batch scoring is hourly/daily prediction over millions of users at once. Streaming scoring is the middle ground. Each fits a different product surface.

02

The three serving modes

Mode	Latency	When predicted	Best for
Batch scoring	Hours-day to refresh; instant lookup	Nightly job, results stored	Email digests, daily-recs widgets, "trending" lists
Streaming scoring	Seconds	Triggered by events (purchase, click)	Fraud detection, post-action triggers
Online (synchronous)	10-100ms p99	Inline with user request	Search ranking, autocomplete, real-time recs

03

Online serving — the architecture

A request arrives. To serve a prediction, the system must:

Fetch features from the feature store — sub-5ms for ~100 features.
Encode + transform — categorical lookups, normalization, embedding fetches.
Run inference — model forward pass, ~5-50ms depending on model size.
Post-process — top-K filtering, business-rule overrides, deduping.
Return — total budget often 100-200ms.

The whole thing scales with traffic. 10k QPS = 10k inferences/sec. At that rate, every millisecond per inference matters. Hence why high-performance model servers (TensorFlow Serving, NVIDIA Triton, TorchServe) exist.

< 50 ms

p99 inference budget

~10 ms

small model on CPU

~30 ms

deep model on GPU

~5 ms

feature fetch (typical)

04

When batch beats online

Online sounds better — fresher predictions. It's also 10-100× more expensive per prediction. Use batch when:

Predictions are stable over hours. "Top 100 movies for this user" probably doesn't change in an hour. Compute nightly, serve from a KV store, save 99% of inference cost.
The product surface tolerates staleness. Email digest, "for you" widget on the homepage. Fine if it was computed 8 hours ago.
The candidate set is small + known ahead of time. Score every user × every item once; serve top-K from cache.

Use online when:

Context matters per request (current session, query, location).
The candidate set is huge and changes (search ranking over millions of products).
Latency-to-prediction matters (autocomplete suggesting words as you type).

05

Deep dive — the candidate generation + ranking pattern

Modern recommendation + search systems split serving into two stages — and use different modes for each:

Stage 1 — Candidate generation. From millions of items, narrow to ~hundreds. Often batch precomputed — a daily job runs ANN search over embeddings to build "top 200 similar items per user." Cached.

Stage 2 — Ranking. The 200 candidates are scored by a richer model online, with full real-time context. Output: ordered top 10 to show.

This split is everywhere — YouTube recommendation, Twitter feed ranking, Amazon search. Stage 1 takes care of "find relevant stuff" cheaply (batch). Stage 2 takes care of "rank what's relevant given right now" expensively (online).

Net cost: full ML model runs on hundreds of items per request, not millions. Latency stays in the 100ms budget. Personalization stays high because stage 2 is fully online with current context.

Interview answer

"Two-stage serving: batch candidate generation (precomputed daily, ANN over embeddings) narrows to top-200, then online ranking model scores those with real-time features. Reduces inference cost by 1000× vs ranking the full catalog online, while keeping freshness."

06

Real-world serving infra

TensorFlow Serving / TorchServe

Open-source model servers

HTTP / gRPC interface, batching, model versioning. The default for self-hosted online serving.

NVIDIA Triton

GPU-optimized

Multi-framework (TF, PyTorch, ONNX), supports GPU + CPU. Best-in-class throughput per GPU.

SageMaker / Vertex AI / Azure ML

Managed

Cloud-managed serving. Auto-scales, blue-green deploys, A/B testing built in. Pricier but zero ops.

Spark batch + Redis cache

The 80% solution

Most production "ML" systems aren't online inference — they're nightly Spark jobs writing to Redis.

07

Used in problems

Typeahead does pure online inference (every keystroke). Recommendation algorithm uses two-stage (batch candidates + online ranking). News feed scores top-200 candidates online for each load.

📺

References & Videos

Online vs Batch Model Serving

ByteByteGo · 10 min

ML Model Serving Infrastructure

Arpit Bhayani · 30 min

ML System Architectures at Netflix

Netflix Tech Blog

Online vs Offline ML

GeeksforGeeks

Model Serving — Online vs Batch