System Design — 16

Recommendation Algorithm

Design the engine behind "Recommended for you" — a system that surfaces personalised content or products for hundreds of millions of users across a catalogue of millions of items, in real time, while continuously learning from new signals.

Machine LearningTwo-Stage FunnelEmbeddingsLambda ArchitectureANN Search
01

Problem Statement

Design a recommendation system that surfaces personalised content or products to users based on their behaviour, preferences, and context. Think of it as the engine behind Netflix's rows, Amazon's "Customers also bought," or YouTube's "Up next."

The system must work for both media (movies, music, videos) and e-commerce (products) — the core architecture is identical. Where they diverge (signal types, catalogue size, engagement metrics), we'll call it out.

Core question: How do you generate relevant, personalised recommendations for hundreds of millions of users across a catalogue of millions of items, in under 200 ms, while continuously learning from new signals?

The reason this is hard isn't any single piece — it's that you need to solve three conflicting problems simultaneously: relevance (show good stuff), freshness (adapt quickly to new behaviour), and latency (do it in under 200 ms for every page load).

02

Requirements

Functional Requirements

  • Personalised recommendations per user — given a user ID, return an ordered list of items ranked by predicted relevance (homepage feed)
  • Similar item recommendations — given an item ID, return related items ("Because you watched X," "Customers also bought")
  • Real-time signal ingestion — incorporate recent user actions (click, purchase, skip) within minutes, not next-day batch
  • Multiple recommendation contexts — homepage (exploratory), item page (narrow), cart page (complementary), email/push
  • Cold-start handling — new users get popularity/demographic-based recs; new items get exposure via content features + exploration

Non-Functional Requirements

  • Low latency — P99 under 200 ms for serving recommendations
  • High availability99.99% uptime; fallback to trending/popular, never an empty page
  • Scalability500 M+ users, 50 M+ item catalogue, billions of events/day
  • Signal freshness — new user action reflected in recs within minutes; new catalogue item within hours
  • Diversity & exploration — prevent filter bubbles; explicit diversity injection across categories
  • Explainability — "Because you watched X" — users need a reason; constrains algorithm choice

Requirements Tension

TensionWhat pullsResolution
Latency vs. RelevanceBest model is a deep net with hundreds of features, but inference on 50 M items is too slowTwo-stage funnel — cheap candidate generation, expensive ranking on 1000 candidates
Freshness vs. CostReal-time recomputation on every click is expensive; batch is cheap but staleLambda architecture — batch for base embeddings, stream for recent signals, merge at serving
Relevance vs. DiversityOptimising purely for CTR creates filter bubblesRe-ranking stage enforces diversity rules post-scoring (MMR)
Explainability vs. PowerDeep learning outperforms but is a black box; CF is explainable but weakerHybrid — deep model for ranking, explainability metadata from candidate generation source
Cold Start vs. PersonalisationNew users/items have no signalContent-based features bootstrap until behavioural signal accumulates
03

Scale Estimation

Every number is derived from an explicit assumption. The numbers drive the architecture — not the other way around.

~20 K
Peak RPS (serving)
~100 K
Peak events/sec (ingest)
1 TB
User embeddings (RAM)
~80 GB
Precomputed rec cache
50 M
Items in catalogue
200 TB/yr
Raw event storage

Derivation

100 M DAU × 5 rec requests/day = 500 M requests/day ÷ 86 400 ≈ 5 800 RPS avg, ~20 K peak. 100 M DAU × 30 events/day = 3 B events/day ≈ 35 K eps avg, ~100 K peak. 500 M users × 2 KB (features + embedding) = 1 TB user representations — must fit in memory. 50 M items × 128-d × 4 bytes ≈ 25 GB item embeddings — fits on one machine, replicated. Precomputed top-100 recs for 100 M active users × 800 bytes = 80 GB Redis cache. Batch pipeline (training + embedding export + index rebuild) takes ~12 h end-to-end — this is why you need the stream layer.

NumberArchitectural consequence
20 K RPSPrecomputation + caching — can't run full model inference on every request
100 K events/secKafka + stream processing; batch ETL can't keep up for real-time features
50 M itemsANN index required — brute-force dot product is too slow online
1 TB embeddingsDistributed in-memory store (Redis cluster) — can't hit disk at 20 K RPS
12 h batch pipelineLambda architecture is unavoidable — batch alone is too stale
04

API Design

GET /v1/recommendations — personalised feed
GET /v1/recommendations?user_id=u_abc&context=homepage&page_size=50&exclude=item_1,item_2

Response 200:
{
  "request_id": "req_7f3a...",
  "items": [{
    "item_id": "item_882", "rank": 1, "score": 0.94,
    "explanation": { "type": "behavioral", "reason": "Because you watched Inception", "source_item_id": "item_331" },
    "metadata": { "title": "Interstellar", "image_url": "..." },
    "rec_source": "collaborative_filtering"
  }, ...],
  "model_version": "v2.4.1",
  "served_from": "cache_with_rerank"
}

context drives routing — homepage triggers broad exploration, cart page triggers complementary retrieval. request_id closes the feedback loop: rec → impression → click → training data. rec_source enables A/B testing per algorithm.

GET /v1/items/{id}/similar — item-to-item
GET /v1/items/item_882/similar?user_id=u_abc&limit=20&diversity=0.3

Response 200:
{
  "source_item": "item_882",
  "items": [{
    "item_id": "item_990", "similarity_score": 0.91,
    "similarity_type": "content",
    "shared_attributes": ["sci-fi", "director:Nolan"]
  }, ...]
}
POST /v1/events — record user action (async)
POST /v1/events
{
  "user_id": "u_abc", "event_type": "click", "item_id": "item_882",
  "context": { "page": "homepage", "position": 3, "request_id": "req_7f3a...", "session_id": "sess_x9f2..." },
  "event_value": { "watch_percent": 0.85 }
}

Response 202 Accepted  // fire-and-forget → Kafka

202 Accepted, not 200 — events go to Kafka asynchronously. context.position enables position bias correction in training. context.request_id joins the impression log to the event log, closing the feedback loop.

05

High-Level Architecture

Three lanes operating at three time scales, connected by a feedback loop. Batch (hours) produces quality. Stream (seconds) produces freshness. Serving (milliseconds) produces speed.

BATCH LANE STREAM LANE SERVING LANE Data Lake S3 / HDFS Model Training GPU Cluster Embeddings User + Item ANN Index FAISS / Milvus Rec Cache Redis · 80 GB Event Collector Stateless Kafka 100 K eps Flink Stream Proc Feature Store Online · Redis batch feed Client Web / Mobile API Gateway Auth + Route Serving Service Cache → ANN → Rank → Re-rank Ranking Model Deep Cross Net Feedback Loop Log → Train training data events
Request Flow — Step Through
ClientAPI GatewayRec CacheANN IndexFeature StoreRanking ModelRe-RankerResponse
Click Next Step to walk through the request flow.
06

Deep Dive — Candidate Generation Strategies

Candidate generation is where recommendation systems are won or lost. The ranking model can only score what the candidate generator retrieves. Four approaches, in increasing sophistication:

1. Item-Based Collaborative Filtering

Compute item-item similarity from co-occurrence in the interaction matrix. Simple, fast (Redis lookup), and naturally explainable: "Customers who bought X also bought Y." But suffers from popularity bias (popular items are similar to everything) and sparsity (99.99% of item pairs have zero co-occurrence at 50 M items).

2. Content-Based Filtering

Match item attributes (genre, brand, price) to user preference vector. Solves cold start for new items (metadata exists on day one). But over-specialises — no serendipity — and requires clean metadata.

3. Matrix Factorisation

Decompose the sparse user-item matrix into dense user and item embeddings in a shared latent space. Score = dot product. Discovers latent patterns that co-occurrence misses. The bridge to modern methods — but can't incorporate side features.

4. Two-Tower Neural Model (Production Standard)

Two separate neural networks produce user and item embeddings. User tower ingests demographics, session context, click history. Item tower ingests category, price, text embeddings. Score = dot product. The towers are decoupled — item embeddings are precomputed and indexed for ANN search. One user tower forward pass (~2 ms) + one ANN lookup (~5 ms) = candidate retrieval from 50 M items in 7 ms.

Key insight: Decoupling the towers is what makes the system serve at 20 K RPS. You do one neural net forward pass plus one ANN lookup instead of 50 M forward passes.

sequenceDiagram participant C as Client participant S as Serving Service participant Cache as Rec Cache participant ANN as ANN Index participant CF as Item-CF Tables participant FS as Feature Store participant R as Ranking Model C->>S: GET /recommendations S->>Cache: Check precomputed recs Cache-->>S: Cache HIT (70%) → return Note over S: Cache MISS or stale ↓ S->>ANN: User embedding → top 500 S->>CF: Recent items → top 200 Note over S: + Content-based 100 + Trending 100 S->>S: Merge & dedup → ~1000 candidates S->>FS: Fetch features for 1000 items FS-->>S: User + item + cross features S->>R: Score 1000 candidates R-->>S: Ranked scores S->>S: Re-rank (diversity, rules, explore) S-->>C: Top 50 with explanations

ANN Index Internals (FAISS IVF+PQ)

Brute-force search across 50 M items: ~130 ms on CPU. Too slow. IVF clusters items into ~4096 cells via K-means, then searches only the nearest nprobe=64 clusters (~780 K items). PQ compresses each 128-d vector from 512 bytes to 32 bytes. Combined: ~3-5 ms for top-1000 at ~95% recall, using ~2 GB RAM.

nprobeItems scannedLatencyRecall@1000
8~98 K0.3 ms~75%
32~390 K1 ms~90%
64~780 K2 ms~95%
128~1.5 M4 ms~98%

Ranking Model (Stage 2)

A Deep & Cross Network scores the 1000 candidates with rich cross-features: user × item interactions (price vs. user average, time since last category purchase, position bias correction). Multi-objective: P(click), P(purchase), P(return). Scoring 1000 items: ~20-30 ms on CPU.

Production Systems Combine All Four

Multiple generators run in parallel — two-tower ANN (500 candidates), item-based CF (200), content-based (100), trending (100), personal history (100) — merge, dedup, and send ~1000 candidates to the ranker. Each generator covers the others' blind spots.

07

Key Design Decisions & Tradeoffs

Two-Stage Funnel vs. Single Full-Catalogue Ranking

✓ Chosen

Two-Stage (Retrieve + Rank)

50 M → 1000 via cheap ANN (5 ms), then 1000 → 50 via rich ranking model (30 ms). Total ~35 ms. Recall loss mitigated by multiple diverse generators.

✗ Alternative

Single Full-Catalogue Ranker

50 M × 0.03 ms = 25 minutes per request. Mathematically impossible at 200 ms P99. Only viable for catalogues under ~10 K items.

Lambda (Batch + Stream) vs. Batch-Only

✓ Chosen

Lambda Architecture

Batch for stable model weights and embeddings (daily). Stream for real-time user features and purchase suppression (seconds). Cost: operational complexity and training-serving skew risk.

✗ Alternative

Batch-Only

30% less engineering effort. But recs are 12-24 h stale — user buys a washing machine, still sees washing machine recommendations. Visible failure.

Two-Tower vs. Feature Cross Model for Candidate Gen

✓ Chosen

Two-Tower (Decoupled)

Item embeddings precomputed, enabling ANN search of 50 M items in 5 ms. Cost: only linear interactions (dot product). Expressiveness recovered at ranking stage.

✗ Alternative

Feature Cross Model

Captures arbitrary feature interactions (user location × item material). But can't precompute — must score each item individually. Only feasible for 1000 candidates, not 50 M.

Daily Batch Retraining vs. Online Learning

✓ Chosen

Daily Batch + Real-Time Features

Model weights frozen daily for stability and A/B testability. Freshness from streaming layer updating user features — model inputs change, outputs adapt. Reproducible, debuggable.

✗ Alternative

Full Online Learning

Noisy gradients from single examples cause oscillation. Catastrophic forgetting of inactive users. Non-reproducible. Can't A/B test a moving target.

Precomputed Cache: Full vs. Hybrid

✓ Chosen

Hybrid (Power Users + On-Demand)

Precompute for top 40 M active users. Bottom 60 M (10% of requests) computed on-demand and cached for 48 h. 60% less batch compute.

✗ Alternative

Full Precompute (100 M Users)

60 M cache entries go stale before anyone reads them. Wasted batch compute for users who visit once a week.

08

What Can Go Wrong

The most dangerous failures are silent — HTTP 200, no errors, but recommendations are garbage for weeks.

Training-Serving Skew

Features computed differently in batch vs. online store. Model sees inputs it never trained on. No errors, just gradual CTR decline. Detect: feature distribution monitoring, shadow scoring. Prevent: shared feature store with single feature definition.

Feedback Loop Collapse (Echo Chamber)

System converges to recommending the same ~500 popular items. Catalogue diversity collapses. Takes weeks to detect. Detect: catalogue coverage metric, recommendation entropy, Gini coefficient of impressions. Prevent: exploration budget (Thompson sampling), popularity normalisation, calibrated re-ranking.

Position Bias in Training Data

Items at position #1 get 10-30× more clicks regardless of relevance. Model learns "items shown first are good" — self-fulfilling prophecy. Prevent: inverse propensity weighting, position-as-a-feature in ranking model, periodic randomisation.

ANN Index Staleness or Corruption

Index rebuild fails silently; serving uses stale embeddings. Or a shard gets corrupted, causing missing candidates for certain categories. Prevent: blue-green index deployment, canary queries, embedding version assertions.

Event Pipeline Lag

Flink falls behind during traffic spikes. Real-time features become stale — purchased items not suppressed, session context outdated. Detect: Kafka consumer lag monitoring. Prevent: Flink autoscaling, separate fast path for critical features (purchase suppression).

Model Quality Regression

New model passes offline metrics but degrades online (corrupted training data, distribution shift, hyperparameter drift). Prevent: offline validation gates, canary deployment to 1% traffic, champion-challenger framework, automated rollback.

Feature Store Outage

Redis cluster down = ranking model scores on zeros. Highest-impact infra failure. Prevent: tiered fallback (precomputed cache → local in-memory cache → popularity-based), cross-region replication, circuit breaker on feature fetching.

Adversarial Manipulation (Shilling Attacks)

Fake accounts with coordinated interactions boost target products. Prevent: reputation-weighted CF (new accounts have 0.01× weight), interaction velocity anomaly detection, co-occurrence anomaly monitoring.

09

Interview Tips

💡
Start with the feedback loop, not the model.
Open with: "The core challenge isn't picking an algorithm — it's building a system where recommendations generate interaction data, that data improves the model, and the improved model generates better recommendations." This signals systems thinking in the first 30 seconds.
📐
Numbers drive design.
Never say "we use Kafka because it's industry standard." Say "100 M DAU × 30 events/day = 35 K eps — this is a streaming problem; batch ETL can't process 100 K eps with sub-minute latency." Every technology choice should be preceded by a number that makes it obvious.
The two-stage funnel is the single most important concept.
Explain with latency math: "50 M × 0.03 ms = 25 minutes. So we split it: ANN retrieval (5 ms) narrows to 1000, ranking (30 ms) selects 50. Cost is recall loss — mitigated by multiple generators."
🎯
Draw three lanes, not just boxes.
Batch (hours) → quality. Stream (seconds) → freshness. Serving (ms) → speed. Label each lane with its time scale. This communicates more architecture understanding in 30 seconds than 5 minutes of talking about components.
🛡️
Proactively raise failure modes.
After the happy path, say: "The most dangerous failures in rec systems are silent — HTTP 200 but garbage recs for weeks." Then cover training-serving skew and feedback loop collapse. Most candidates only describe the happy path.
📉
Show you can scale down.
"At 10 K users and 1 K items? SQL co-occurrence query, single PostgreSQL, no ML, no Kafka. The numbers don't justify distributed complexity." Walking the evolution ladder down shows you reason from constraints, not pattern-match.
11

Evolution

Each stage is triggered by a specific bottleneck. Add complexity only when the numbers demand it.

1

MVP — SQL Co-Occurrence (10 K users, 1 K items)

Single PostgreSQL. Co-purchase query via JOIN, category-level popularity fallback. No ML, no Redis. Focus on getting the event schema right — context.position and request_id from day one. Advance when query exceeds 200 ms or 1 M events collected.

2

First ML — Matrix Factorisation (500 K users, 100 K items)

ALS in a nightly cron job, embeddings in Redis, brute-force scoring of full catalogue (0.1 ms at 100 K items). Cold-start fallback to category popularity. Discover staleness problem — add purchase suppression filter. Advance when catalogue exceeds 100 K or events exceed 5 K/sec.

3

Production ML — Two-Tower + ANN (10 M users, 1 M items)

Introduce Kafka for event ingestion, FAISS IVFPQ index, two-tower model on GPU, simple stream consumer for session features. First version of the full serving pipeline. Advance when feature engineering bottlenecks emerge or scale exceeds 100 M users.

4

Full Platform (100 M+ users, 50 M items)

Feature store (Feast/Tecton), Flink for stream processing, multi-generator candidate pipeline, deep ranking model with cross features, experimentation platform with canary deployment. Organisation splits into specialised teams: ML platform, rec science, ranking, online systems, data engineering.

5

Planet Scale (500 M+ users, multi-region)

Multi-region deployment (EU data stays in EU for GDPR). Cross-surface recommendations (mobile, web, TV, voice) — one user profile, surface-aware ranking. Distributed GPU training with model parallelism. Mixture of Experts, sequential transformers (BERT4Rec), reinforcement learning for long-term retention, LLM augmentation for cold-start and explanations.

Next up