Exercise · Search & Discovery

Google News

Whiteboard exercise. Try the problem cold, then reveal the rubric to self-score.

Out of 10 points45 min whiteboardReference solution →

Prompt

Crawl 50K+ news sources worldwide, rank by freshness + authority + personalization, and serve a unique feed to each of 1B users. The hard parts: near-duplicate clustering so users see one card per story with "N sources" instead of 50 identical headlines, freshness decay that surfaces breaking news in seconds but lets evergreen content linger, and a diversity constraint (greedy MMR) that prevents any single source from dominating the feed. Google News, Apple News, MSN News, Flipboard -- same pattern, different editorial stance. The system must balance three competing goals: recency (show what's happening now), authority (from trustworthy sources), and relevance (personalized to each user's interests).

Time budget: 45 min whiteboard. Draw architecture, estimate numbers, discuss tradeoffs.

Hints (progressive — click to reveal)

Hint 1

Lead with clustering. "The key insight is that 200 outlets publish the same story -- we cluster near-duplicates via SimHash and show one representative per cluster." This immediately shows you understand what makes news aggregation different from generic feed.

Hint 2

Name the freshness model. "score = authority x e^(-lambda x age) where lambda varies by article type." Concrete formula beats hand-waving about "freshness matters."

Hint 3

Explain MMR for diversity. "Greedy Maximal Marginal Relevance -- each pick maximizes relevance while minimizing similarity to already-selected items." Shows you know recommender-system theory.

Rubric — 10 points

+2 Lead with clustering. "The key insight is that 200 outlets publish the same story -- we cluster near-duplicates via SimHash and show one representative per cluster." This immediately shows you understand what makes news aggregation different from generic feed.
+2 Name the freshness model. "score = authority x e^(-lambda x age) where lambda varies by article type." Concrete formula beats hand-waving about "freshness matters."
+2 Explain MMR for diversity. "Greedy Maximal Marginal Relevance -- each pick maximizes relevance while minimizing similarity to already-selected items." Shows you know recommender-system theory.
+1 Adaptive crawl rate is the politeness story. Don't just say "we crawl." Say "CNN every 2 min, local blog every 6 hours, based on learned publish frequency." Shows operational maturity.
+1 Distinguish from social feed. News feed ranks professional content by authority + freshness. Social feed ranks user-generated content by engagement + social graph. Different ranking signals, different abuse vectors.
+1 Mention the cold-start problem. New user with no click history: fall back to location-based trending + globally popular stories. After ~20 clicks, the interest vector has enough signal for personalization. Explicit topic follows accelerate cold-start.
+1 Crawl-to-display latency is a differentiator. "Breaking news visible in < 5 min: priority re-crawl triggered by trending velocity spike, plus push-based ingestion from wire services (AP, Reuters)." Shows you think about the full pipeline end-to-end.

Self-score: tally the points you would have mentioned unprompted. 7+ is interview-ready on this problem.

Red flags (things that tank the interview)

Re-crawl all 50K sources every 5 minutes
Show 5 articles from the same source on the same topic
Rank purely by click-through rate
Use exact string matching for deduplication
Pre-compute feeds for all 1B users every 5 minutes