Concept · Machine Learning Systems

Model Serving — Online vs Batch

01

Why this matters

You trained a recommendation model. Now what — does it predict in real time when the user opens the app, or do you precompute every user's recs nightly and read from a cache? Same model, completely different infrastructure. Get the wrong answer and you either burn 10× the compute or fail to personalize at all.

Online serving is sub-100ms inference per request. Batch scoring is hourly/daily prediction over millions of users at once. Streaming scoring is the middle ground. Each fits a different product surface.

02

The three serving modes

ModeLatencyWhen predictedBest for
Batch scoringHours-day to refresh; instant lookupNightly job, results storedEmail digests, daily-recs widgets, "trending" lists
Streaming scoringSecondsTriggered by events (purchase, click)Fraud detection, post-action triggers
Online (synchronous)10-100ms p99Inline with user requestSearch ranking, autocomplete, real-time recs
03

Online serving — the architecture

A request arrives. To serve a prediction, the system must:

  1. Fetch features from the feature store — sub-5ms for ~100 features.
  2. Encode + transform — categorical lookups, normalization, embedding fetches.
  3. Run inference — model forward pass, ~5-50ms depending on model size.
  4. Post-process — top-K filtering, business-rule overrides, deduping.
  5. Return — total budget often 100-200ms.

The whole thing scales with traffic. 10k QPS = 10k inferences/sec. At that rate, every millisecond per inference matters. Hence why high-performance model servers (TensorFlow Serving, NVIDIA Triton, TorchServe) exist.

< 50 ms
p99 inference budget
~10 ms
small model on CPU
~30 ms
deep model on GPU
~5 ms
feature fetch (typical)
04

When batch beats online

Online sounds better — fresher predictions. It's also 10-100× more expensive per prediction. Use batch when:

  • Predictions are stable over hours. "Top 100 movies for this user" probably doesn't change in an hour. Compute nightly, serve from a KV store, save 99% of inference cost.
  • The product surface tolerates staleness. Email digest, "for you" widget on the homepage. Fine if it was computed 8 hours ago.
  • The candidate set is small + known ahead of time. Score every user × every item once; serve top-K from cache.

Use online when:

  • Context matters per request (current session, query, location).
  • The candidate set is huge and changes (search ranking over millions of products).
  • Latency-to-prediction matters (autocomplete suggesting words as you type).
05

Deep dive — the candidate generation + ranking pattern

Modern recommendation + search systems split serving into two stages — and use different modes for each:

Stage 1 — Candidate generation. From millions of items, narrow to ~hundreds. Often batch precomputed — a daily job runs ANN search over embeddings to build "top 200 similar items per user." Cached.

Stage 2 — Ranking. The 200 candidates are scored by a richer model online, with full real-time context. Output: ordered top 10 to show.

This split is everywhere — YouTube recommendation, Twitter feed ranking, Amazon search. Stage 1 takes care of "find relevant stuff" cheaply (batch). Stage 2 takes care of "rank what's relevant given right now" expensively (online).

Net cost: full ML model runs on hundreds of items per request, not millions. Latency stays in the 100ms budget. Personalization stays high because stage 2 is fully online with current context.

Interview answer

"Two-stage serving: batch candidate generation (precomputed daily, ANN over embeddings) narrows to top-200, then online ranking model scores those with real-time features. Reduces inference cost by 1000× vs ranking the full catalog online, while keeping freshness."

06

Real-world serving infra

TensorFlow Serving / TorchServe

Open-source model servers

HTTP / gRPC interface, batching, model versioning. The default for self-hosted online serving.

NVIDIA Triton

GPU-optimized

Multi-framework (TF, PyTorch, ONNX), supports GPU + CPU. Best-in-class throughput per GPU.

SageMaker / Vertex AI / Azure ML

Managed

Cloud-managed serving. Auto-scales, blue-green deploys, A/B testing built in. Pricier but zero ops.

Spark batch + Redis cache

The 80% solution

Most production "ML" systems aren't online inference — they're nightly Spark jobs writing to Redis.

07

Used in problems

Typeahead does pure online inference (every keystroke). Recommendation algorithm uses two-stage (batch candidates + online ranking). News feed scores top-200 candidates online for each load.

Next up