You trained a recommendation model. Now what — does it predict in real time when the user opens the app, or do you precompute every user's recs nightly and read from a cache? Same model, completely different infrastructure. Get the wrong answer and you either burn 10× the compute or fail to personalize at all.
Online serving is sub-100ms inference per request. Batch scoring is hourly/daily prediction over millions of users at once. Streaming scoring is the middle ground. Each fits a different product surface.
The whole thing scales with traffic. 10k QPS = 10k inferences/sec. At that rate, every millisecond per inference matters. Hence why high-performance model servers (TensorFlow Serving, NVIDIA Triton, TorchServe) exist.
< 50 ms
p99 inference budget
~10 ms
small model on CPU
~30 ms
deep model on GPU
~5 ms
feature fetch (typical)
04
When batch beats online
Online sounds better — fresher predictions. It's also 10-100× more expensive per prediction. Use batch when:
Predictions are stable over hours. "Top 100 movies for this user" probably doesn't change in an hour. Compute nightly, serve from a KV store, save 99% of inference cost.
The product surface tolerates staleness. Email digest, "for you" widget on the homepage. Fine if it was computed 8 hours ago.
The candidate set is small + known ahead of time. Score every user × every item once; serve top-K from cache.
Use online when:
Context matters per request (current session, query, location).
The candidate set is huge and changes (search ranking over millions of products).
Latency-to-prediction matters (autocomplete suggesting words as you type).
05
Deep dive — the candidate generation + ranking pattern
Modern recommendation + search systems split serving into two stages — and use different modes for each:
Stage 1 — Candidate generation. From millions of items, narrow to ~hundreds. Often batch precomputed — a daily job runs ANN search over embeddings to build "top 200 similar items per user." Cached.
Stage 2 — Ranking. The 200 candidates are scored by a richer model online, with full real-time context. Output: ordered top 10 to show.
This split is everywhere — YouTube recommendation, Twitter feed ranking, Amazon search. Stage 1 takes care of "find relevant stuff" cheaply (batch). Stage 2 takes care of "rank what's relevant given right now" expensively (online).
Net cost: full ML model runs on hundreds of items per request, not millions. Latency stays in the 100ms budget. Personalization stays high because stage 2 is fully online with current context.
Interview answer
"Two-stage serving: batch candidate generation (precomputed daily, ANN over embeddings) narrows to top-200, then online ranking model scores those with real-time features. Reduces inference cost by 1000× vs ranking the full catalog online, while keeping freshness."
06
Real-world serving infra
TensorFlow Serving / TorchServe
Open-source model servers
HTTP / gRPC interface, batching, model versioning. The default for self-hosted online serving.