Recommendation Algorithm

01

Problem Statement

Design a recommendation system that surfaces personalised content or products to users based on their behaviour, preferences, and context. Think of it as the engine behind Netflix's rows, Amazon's "Customers also bought," or YouTube's "Up next."

The system must work for both media (movies, music, videos) and e-commerce (products) — the core architecture is identical. Where they diverge (signal types, catalogue size, engagement metrics), we'll call it out.

Core question: How do you generate relevant, personalised recommendations for hundreds of millions of users across a catalogue of millions of items, in under 200 ms, while continuously learning from new signals?

The reason this is hard isn't any single piece — it's that you need to solve three conflicting problems simultaneously: relevance (show good stuff), freshness (adapt quickly to new behaviour), and latency (do it in under 200 ms for every page load).

02

Requirements

Functional Requirements

Personalised recommendations per user — given a user ID, return an ordered list of items ranked by predicted relevance (homepage feed)
Similar item recommendations — given an item ID, return related items ("Because you watched X," "Customers also bought")
Real-time signal ingestion — incorporate recent user actions (click, purchase, skip) within minutes, not next-day batch
Multiple recommendation contexts — homepage (exploratory), item page (narrow), cart page (complementary), email/push
Cold-start handling — new users get popularity/demographic-based recs; new items get exposure via content features + exploration

Non-Functional Requirements

Low latency — P99 under 200 ms for serving recommendations
High availability — 99.99% uptime; fallback to trending/popular, never an empty page
Scalability — 500 M+ users, 50 M+ item catalogue, billions of events/day
Signal freshness — new user action reflected in recs within minutes; new catalogue item within hours
Diversity & exploration — prevent filter bubbles; explicit diversity injection across categories
Explainability — "Because you watched X" — users need a reason; constrains algorithm choice

Requirements Tension

Tension	What pulls	Resolution
Latency vs. Relevance	Best model is a deep net with hundreds of features, but inference on 50 M items is too slow	Two-stage funnel — cheap candidate generation, expensive ranking on 1000 candidates
Freshness vs. Cost	Real-time recomputation on every click is expensive; batch is cheap but stale	Lambda architecture — batch for base embeddings, stream for recent signals, merge at serving
Relevance vs. Diversity	Optimising purely for CTR creates filter bubbles	Re-ranking stage enforces diversity rules post-scoring (MMR)
Explainability vs. Power	Deep learning outperforms but is a black box; CF is explainable but weaker	Hybrid — deep model for ranking, explainability metadata from candidate generation source
Cold Start vs. Personalisation	New users/items have no signal	Content-based features bootstrap until behavioural signal accumulates

03

Scale Estimation

Every number is derived from an explicit assumption. The numbers drive the architecture — not the other way around.

~20 K

Peak RPS (serving)

~100 K

Peak events/sec (ingest)

1 TB

User embeddings (RAM)

~80 GB

Precomputed rec cache

50 M

Items in catalogue

200 TB/yr

Raw event storage

Derivation

100 M DAU × 5 rec requests/day = 500 M requests/day ÷ 86 400 ≈ 5 800 RPS avg, ~20 K peak. 100 M DAU × 30 events/day = 3 B events/day ≈ 35 K eps avg, ~100 K peak. 500 M users × 2 KB (features + embedding) = 1 TB user representations — must fit in memory. 50 M items × 128-d × 4 bytes ≈ 25 GB item embeddings — fits on one machine, replicated. Precomputed top-100 recs for 100 M active users × 800 bytes = 80 GB Redis cache. Batch pipeline (training + embedding export + index rebuild) takes ~12 h end-to-end — this is why you need the stream layer.

Number	Architectural consequence
20 K RPS	Precomputation + caching — can't run full model inference on every request
100 K events/sec	Kafka + stream processing; batch ETL can't keep up for real-time features
50 M items	ANN index required — brute-force dot product is too slow online
1 TB embeddings	Distributed in-memory store (Redis cluster) — can't hit disk at 20 K RPS
12 h batch pipeline	Lambda architecture is unavoidable — batch alone is too stale

04

API Design

GET /v1/recommendations — personalised feed

GET /v1/recommendations?user_id=u_abc&context=homepage&page_size=50&exclude=item_1,item_2

Response 200:
{
  "request_id": "req_7f3a...",
  "items": [{
    "item_id": "item_882", "rank": 1, "score": 0.94,
    "explanation": { "type": "behavioral", "reason": "Because you watched Inception", "source_item_id": "item_331" },
    "metadata": { "title": "Interstellar", "image_url": "..." },
    "rec_source": "collaborative_filtering"
  }, ...],
  "model_version": "v2.4.1",
  "served_from": "cache_with_rerank"
}

context drives routing — homepage triggers broad exploration, cart page triggers complementary retrieval. request_id closes the feedback loop: rec → impression → click → training data. rec_source enables A/B testing per algorithm.

GET /v1/items/{id}/similar — item-to-item

GET /v1/items/item_882/similar?user_id=u_abc&limit=20&diversity=0.3

Response 200:
{
  "source_item": "item_882",
  "items": [{
    "item_id": "item_990", "similarity_score": 0.91,
    "similarity_type": "content",
    "shared_attributes": ["sci-fi", "director:Nolan"]
  }, ...]
}

POST /v1/events — record user action (async)

POST /v1/events
{
  "user_id": "u_abc", "event_type": "click", "item_id": "item_882",
  "context": { "page": "homepage", "position": 3, "request_id": "req_7f3a...", "session_id": "sess_x9f2..." },
  "event_value": { "watch_percent": 0.85 }
}

Response 202 Accepted  // fire-and-forget → Kafka

202 Accepted, not 200 — events go to Kafka asynchronously. context.position enables position bias correction in training. context.request_id joins the impression log to the event log, closing the feedback loop.

05

High-Level Architecture

Three lanes operating at three time scales, connected by a feedback loop. Batch (hours) produces quality. Stream (seconds) produces freshness. Serving (milliseconds) produces speed.

Request Flow — Step Through

Client→API Gateway→Rec Cache→ANN Index→Feature Store→Ranking Model→Re-Ranker→Response

Click Next Step to walk through the request flow.

06

Deep Dive — Candidate Generation Strategies

Candidate generation is where recommendation systems are won or lost. The ranking model can only score what the candidate generator retrieves. Four approaches, in increasing sophistication:

1. Item-Based Collaborative Filtering

Compute item-item similarity from co-occurrence in the interaction matrix. Simple, fast (Redis lookup), and naturally explainable: "Customers who bought X also bought Y." But suffers from popularity bias (popular items are similar to everything) and sparsity (99.99% of item pairs have zero co-occurrence at 50 M items).

2. Content-Based Filtering

Match item attributes (genre, brand, price) to user preference vector. Solves cold start for new items (metadata exists on day one). But over-specialises — no serendipity — and requires clean metadata.

3. Matrix Factorisation

Decompose the sparse user-item matrix into dense user and item embeddings in a shared latent space. Score = dot product. Discovers latent patterns that co-occurrence misses. The bridge to modern methods — but can't incorporate side features.

4. Two-Tower Neural Model (Production Standard)

Two separate neural networks produce user and item embeddings. User tower ingests demographics, session context, click history. Item tower ingests category, price, text embeddings. Score = dot product. The towers are decoupled — item embeddings are precomputed and indexed for ANN search. One user tower forward pass (~2 ms) + one ANN lookup (~5 ms) = candidate retrieval from 50 M items in 7 ms.

Key insight: Decoupling the towers is what makes the system serve at 20 K RPS. You do one neural net forward pass plus one ANN lookup instead of 50 M forward passes.

sequenceDiagram participant C as Client participant S as Serving Service participant Cache as Rec Cache participant ANN as ANN Index participant CF as Item-CF Tables participant FS as Feature Store participant R as Ranking Model C->>S: GET /recommendations S->>Cache: Check precomputed recs Cache-->>S: Cache HIT (70%) → return Note over S: Cache MISS or stale ↓ S->>ANN: User embedding → top 500 S->>CF: Recent items → top 200 Note over S: + Content-based 100 + Trending 100 S->>S: Merge & dedup → ~1000 candidates S->>FS: Fetch features for 1000 items FS-->>S: User + item + cross features S->>R: Score 1000 candidates R-->>S: Ranked scores S->>S: Re-rank (diversity, rules, explore) S-->>C: Top 50 with explanations

ANN Index Internals (FAISS IVF+PQ)

Brute-force search across 50 M items: ~130 ms on CPU. Too slow. IVF clusters items into ~4096 cells via K-means, then searches only the nearest nprobe=64 clusters (~780 K items). PQ compresses each 128-d vector from 512 bytes to 32 bytes. Combined: ~3-5 ms for top-1000 at ~95% recall, using ~2 GB RAM.

nprobe	Items scanned	Latency	Recall@1000
8	~98 K	0.3 ms	~75%
32	~390 K	1 ms	~90%
64	~780 K	2 ms	~95%
128	~1.5 M	4 ms	~98%

Ranking Model (Stage 2)

A Deep & Cross Network scores the 1000 candidates with rich cross-features: user × item interactions (price vs. user average, time since last category purchase, position bias correction). Multi-objective: P(click), P(purchase), P(return). Scoring 1000 items: ~20-30 ms on CPU.

Production Systems Combine All Four

Multiple generators run in parallel — two-tower ANN (500 candidates), item-based CF (200), content-based (100), trending (100), personal history (100) — merge, dedup, and send ~1000 candidates to the ranker. Each generator covers the others' blind spots.

07

60 M cache entries go stale before anyone reads them. Wasted batch compute for users who visit once a week.

08

What Can Go Wrong

The most dangerous failures are silent — HTTP 200, no errors, but recommendations are garbage for weeks.

Training-Serving Skew

Features computed differently in batch vs. online store. Model sees inputs it never trained on. No errors, just gradual CTR decline. Detect: feature distribution monitoring, shadow scoring. Prevent: shared feature store with single feature definition.

Feedback Loop Collapse (Echo Chamber)

System converges to recommending the same ~500 popular items. Catalogue diversity collapses. Takes weeks to detect. Detect: catalogue coverage metric, recommendation entropy, Gini coefficient of impressions. Prevent: exploration budget (Thompson sampling), popularity normalisation, calibrated re-ranking.

Position Bias in Training Data

Items at position #1 get 10-30× more clicks regardless of relevance. Model learns "items shown first are good" — self-fulfilling prophecy. Prevent: inverse propensity weighting, position-as-a-feature in ranking model, periodic randomisation.

ANN Index Staleness or Corruption

Index rebuild fails silently; serving uses stale embeddings. Or a shard gets corrupted, causing missing candidates for certain categories. Prevent: blue-green index deployment, canary queries, embedding version assertions.

Event Pipeline Lag

Flink falls behind during traffic spikes. Real-time features become stale — purchased items not suppressed, session context outdated. Detect: Kafka consumer lag monitoring. Prevent: Flink autoscaling, separate fast path for critical features (purchase suppression).

Model Quality Regression

New model passes offline metrics but degrades online (corrupted training data, distribution shift, hyperparameter drift). Prevent: offline validation gates, canary deployment to 1% traffic, champion-challenger framework, automated rollback.

Feature Store Outage

Redis cluster down = ranking model scores on zeros. Highest-impact infra failure. Prevent: tiered fallback (precomputed cache → local in-memory cache → popularity-based), cross-region replication, circuit breaker on feature fetching.

Adversarial Manipulation (Shilling Attacks)

Fake accounts with coordinated interactions boost target products. Prevent: reputation-weighted CF (new accounts have 0.01× weight), interaction velocity anomaly detection, co-occurrence anomaly monitoring.

09

Interview Tips

💡

Start with the feedback loop, not the model.
Open with: "The core challenge isn't picking an algorithm — it's building a system where recommendations generate interaction data, that data improves the model, and the improved model generates better recommendations." This signals systems thinking in the first 30 seconds.

📐

Numbers drive design.
Never say "we use Kafka because it's industry standard." Say "100 M DAU × 30 events/day = 35 K eps — this is a streaming problem; batch ETL can't process 100 K eps with sub-minute latency." Every technology choice should be preceded by a number that makes it obvious.

⚡

The two-stage funnel is the single most important concept.
Explain with latency math: "50 M × 0.03 ms = 25 minutes. So we split it: ANN retrieval (5 ms) narrows to 1000, ranking (30 ms) selects 50. Cost is recall loss — mitigated by multiple generators."

🎯

Draw three lanes, not just boxes.
Batch (hours) → quality. Stream (seconds) → freshness. Serving (ms) → speed. Label each lane with its time scale. This communicates more architecture understanding in 30 seconds than 5 minutes of talking about components.

🛡️

Proactively raise failure modes.
After the happy path, say: "The most dangerous failures in rec systems are silent — HTTP 200 but garbage recs for weeks." Then cover training-serving skew and feedback loop collapse. Most candidates only describe the happy path.

📉

Show you can scale down.
"At 10 K users and 1 K items? SQL co-occurrence query, single PostgreSQL, no ML, no Kafka. The numbers don't justify distributed complexity." Walking the evolution ladder down shows you reason from constraints, not pattern-match.

10

Evolution

Each stage is triggered by a specific bottleneck. Add complexity only when the numbers demand it.

1

MVP — SQL Co-Occurrence (10 K users, 1 K items)

Single PostgreSQL. Co-purchase query via JOIN, category-level popularity fallback. No ML, no Redis. Focus on getting the event schema right — context.position and request_id from day one. Advance when query exceeds 200 ms or 1 M events collected.

2

First ML — Matrix Factorisation (500 K users, 100 K items)

ALS in a nightly cron job, embeddings in Redis, brute-force scoring of full catalogue (0.1 ms at 100 K items). Cold-start fallback to category popularity. Discover staleness problem — add purchase suppression filter. Advance when catalogue exceeds 100 K or events exceed 5 K/sec.

3

Production ML — Two-Tower + ANN (10 M users, 1 M items)

Introduce Kafka for event ingestion, FAISS IVFPQ index, two-tower model on GPU, simple stream consumer for session features. First version of the full serving pipeline. Advance when feature engineering bottlenecks emerge or scale exceeds 100 M users.

4

Full Platform (100 M+ users, 50 M items)

Feature store (Feast/Tecton), Flink for stream processing, multi-generator candidate pipeline, deep ranking model with cross features, experimentation platform with canary deployment. Organisation splits into specialised teams: ML platform, rec science, ranking, online systems, data engineering.

5

Planet Scale (500 M+ users, multi-region)

Multi-region deployment (EU data stays in EU for GDPR). Cross-surface recommendations (mobile, web, TV, voice) — one user profile, surface-aware ranking. Distributed GPU training with model parallelism. Mixture of Experts, sequential transformers (BERT4Rec), reinforcement learning for long-term retention, LLM augmentation for cold-start and explanations.

📺

References & Videos

Recommendation System Design

ByteByteGo · 10 min

How Netflix Recommendation System Works

Gaurav Sen · 18 min

System Architectures for Personalization

Netflix Tech Blog

Design a Recommendation System

AlgoMaster

Problem Statement

Requirements

Functional Requirements

Non-Functional Requirements

Requirements Tension

Scale Estimation

Derivation

API Design

High-Level Architecture

Deep Dive — Candidate Generation Strategies

1. Item-Based Collaborative Filtering

2. Content-Based Filtering

3. Matrix Factorisation

4. Two-Tower Neural Model (Production Standard)

ANN Index Internals (FAISS IVF+PQ)

Ranking Model (Stage 2)

Production Systems Combine All Four

Key Design Decisions & Tradeoffs

Two-Stage Funnel vs. Single Full-Catalogue Ranking

Two-Stage (Retrieve + Rank)

Single Full-Catalogue Ranker

Lambda (Batch + Stream) vs. Batch-Only

Lambda Architecture

Batch-Only

Two-Tower vs. Feature Cross Model for Candidate Gen

Two-Tower (Decoupled)

Feature Cross Model

Daily Batch Retraining vs. Online Learning

Daily Batch + Real-Time Features

Full Online Learning

Precomputed Cache: Full vs. Hybrid

Hybrid (Power Users + On-Demand)

Full Precompute (100 M Users)

What Can Go Wrong

Training-Serving Skew

Feedback Loop Collapse (Echo Chamber)

Position Bias in Training Data

ANN Index Staleness or Corruption

Event Pipeline Lag

Model Quality Regression

Feature Store Outage

Adversarial Manipulation (Shilling Attacks)

Interview Tips

Similar Problems

Search Ranking

Ad Ranking

News Feed (Twitter)

Notification System

YouTube / Netflix

Distributed Message Queue

Evolution

MVP — SQL Co-Occurrence (10 K users, 1 K items)

First ML — Matrix Factorisation (500 K users, 100 K items)

Production ML — Two-Tower + ANN (10 M users, 1 M items)

Full Platform (100 M+ users, 50 M items)

Planet Scale (500 M+ users, multi-region)

References & Videos

Search Ranking

Ad Ranking

HyperLogLog & Sketches