Google News

Crawl 50K+ news sources worldwide, rank by freshness + authority + personalization, and serve a unique feed to each of 1B users. The hard parts: near-duplicate clustering so users see one card per story with "N sources" instead of 50 identical headlines, freshness decay that surfaces breaking news in seconds but lets evergreen content linger, and a diversity constraint (greedy MMR) that prevents any single source from dominating the feed. Google News, Apple News, MSN News, Flipboard -- same pattern, different editorial stance. The system must balance three competing goals: recency (show what's happening now), authority (from trustworthy sources), and relevance (personalized to each user's interests).

Core: Crawl + Cluster + Rank + Personalize1B users100K articles/day200K feed QPS10B indexed articles
02

Requirements

Functional
  • Crawl 50K+ sources via RSS feeds + web scraping; ingest ~100K new articles/day
  • Extract entities, classify topic, detect language, and cluster near-duplicate articles covering the same story
  • Personalized feed per user based on interests, location, click history
  • Trending topics: detect stories gaining velocity across sources
  • Full-text search across the article corpus
  • Engagement signals: clicks, dwell time, shares feed back into ranking
  • "Full coverage" view: all articles in a story cluster, grouped by perspective
  • Topic follow/unfollow: users explicitly curate interest areas
Non-Functional
  • Feed latency < 200 ms p99 for cached users
  • Breaking news appears in feed within < 5 min of publication
  • No more than 2 articles from the same source in top-20 results
  • Handle 200K feed reads/sec at peak hours
  • Crawler respects robots.txt and rate-limits per domain (politeness)
  • Misinformation ranking suppression via authority scoring
  • Multi-language support: detect and serve content in user's preferred language
  • 99.9% availability -- news is time-sensitive; downtime = missed stories
03

Scale Estimation

Daily active users
~300M
of 1B registered; each opens feed ~3x/day
Feed QPS (peak)
~200K
300M users x 3 opens / 86400 sec, with 2x peak multiplier
New articles/day
~100K
from 50K+ sources; after dedup ~30K unique stories
Indexed corpus
~10B
articles accumulated over 20+ years
Crawl bandwidth
~5 TB/day
100K articles x ~50 KB avg HTML each
Cluster store
~50M
active story clusters in rolling 30-day window
User interest model
~1 KB/user
128-dim topic vector + location + language; 1B users = ~1 TB
Click events/sec
~50K
engagement signals flowing into Kafka for ML pipeline
04

API Design

GET/api/feed?user_id={uid}&interests={list}&location={loc}

Personalized feed. Returns top-N ranked story clusters with representative article, source count, and topic label. Paginated via cursor. Cache-friendly: ETag per user+timestamp bucket.

GET/api/topics/{topic}/articles

All articles for a given topic (e.g., "technology", "sports"). Returns ranked list with cluster grouping. Supports sort=freshness|relevance.

GET/api/trending

Top trending stories right now. Computed from velocity of new articles + clicks in sliding 1-hour window. Returns [{cluster_id, headline, source_count, velocity}].

POST/api/article/click

Engagement signal. Body: {user_id, article_id, dwell_ms, action: click|share|dismiss}. Written to Kafka for async processing. Updates user interest model + article quality signal.

GET/api/search?q={term}&freshness={24h|7d|30d}

Full-text search over article corpus. Returns ranked results with snippet highlighting. Freshness filter defaults to 7d. Powered by inverted index (Elasticsearch). Results include cluster metadata so the UI can show "N sources" per story.

GET/api/clusters/{cluster_id}/sources

All articles in a story cluster. Returns [{source, headline, url, publish_time, authority_score}] sorted by authority. Used for the "full coverage" view where users see all perspectives on one story.

05

Architecture

Four pipelines isolated by concern:

  • Crawl tier: Distributed crawlers fetch articles from 50K+ sources on adaptive schedules. URL frontier prioritizes sources by publish frequency. Respects robots.txt; backs off on errors.
  • Content tier: Kafka-backed pipeline extracts clean text, detects language, classifies topic, extracts named entities, and computes SimHash fingerprint for cluster matching.
  • Ranking tier: Scores clusters by authority x freshness decay, personalizes by user interest vector (128-dim topic embedding from click history), and applies MMR diversity constraint.
  • Serving tier: Pre-computed feeds cached in Redis with 5-min TTL. On-request lightweight re-rank injects breaking news. CDN caches trending and topic pages.

Offline ML trainer closes the loop: click/dwell data feeds back into the ranking model weights and user interest vectors via nightly batch jobs.

Google News ArchitectureSVG
RSS / Web Crawlers50K+ sources Ingest QueueKafka partitioned Content Pipelineextract + classifyentity + dedup Article StoreBigtable Cluster Storenear-dup groups Ranker Servicescore + personalize Feed CacheRedis per-user Users (1B)personalized feed Elasticsearchfull-text search Click StreamKafka engagement ML Traineroffline on click data User Interest Model: topic affinities + location + reading history URL Frontier adaptive schedule DNS Resolver cached + polite robots.txt Cache respect crawl-delay Trending Detector velocity window
Request Flow — Step Through
RSS/Web Crawlers · 50K+ sourcesIngest Queue · Kafka partitionedContent Pipeline · extract + classifySimHash Cluster · near-dup groupingRanker Service · authority x freshnessFeed Cache · Redis per-userUser Feed · personalized top-K
Click Next Step to walk through the request flow.
06

Deep Dive — Clustering, Freshness & Diversity

(a) Near-duplicate clustering. When 200 outlets publish "President signs climate bill," the user should see one card with "200+ sources." We use SimHash (or MinHash) on article text to produce a 64-bit fingerprint. Two articles with Hamming distance ≤ 3 are considered near-duplicates.

-- SimHash clustering pseudocode
fingerprint = simhash(article.text)          -- 64-bit hash
candidates  = lookup_lsh(fingerprint, k=3)   -- LSH index: find hashes within Hamming distance 3
if candidates:
    best_cluster = closest(candidates)
    merge(article, best_cluster)              -- add to existing story cluster
else:
    create_cluster(article)                   -- new story
-- representative = highest-authority article in cluster

Each cluster stores: representative article (highest authority score), source list, earliest publish time, and topic labels. The feed shows the representative with "N sources" badge.

Why SimHash over MinHash? SimHash produces a single 64-bit fingerprint per document -- constant space regardless of document length. MinHash produces a signature of k hashes (typically k=128), more accurate for Jaccard similarity but 128x more storage. For news articles (similar length, mostly text), SimHash at distance 3 achieves ~95% recall with far less storage. We use MinHash as a secondary offline check for borderline cases (Hamming distance 3-5).

LSH index structure. We split the 64-bit SimHash into 4 bands of 16 bits. Two documents that share at least one identical 16-bit band are candidates. This gives sub-millisecond lookups against the 50M active cluster fingerprints, stored in memory across a sharded hash table.

(b) Freshness decay. A breaking-news article from 10 minutes ago should rank far above yesterday's story. We model this with exponential decay:

score = authority_score * e^(-lambda * age_hours)

-- Breaking news:  lambda = 0.5  (half-life ~1.4 hours)
-- Regular news:   lambda = 0.1  (half-life ~7 hours)
-- Evergreen:      lambda = 0.02 (half-life ~35 hours)

The system classifies each article's decay rate based on topic and velocity (how fast new articles join the cluster). A cluster gaining 50 new articles/hour gets breaking-news lambda.

Adaptive lambda selection. A simple heuristic: if cluster.article_count_last_hour > 20, use breaking lambda (0.5). If the article's topic is in ["obituary", "historical", "explainer"], use evergreen lambda (0.02). Default: regular (0.1). The ML trainer can also learn per-topic lambda from engagement data -- topics where users prefer recency get higher lambda.

(c) Diversity via greedy MMR. Without diversity constraints, a user interested in politics would see 20 politics articles from CNN. We apply Maximal Marginal Relevance:

selected = []
for i in range(top_K):
    best = argmax over candidates:
        alpha * relevance(candidate, user)
        - (1 - alpha) * max_similarity(candidate, selected)
    selected.append(best)
-- Constraint: no more than 2 articles from same source in top-20

Each pick maximizes relevance to the user while minimizing similarity to already-selected articles. Alpha = 0.7 balances relevance vs diversity. Hard cap: max 2 articles per source domain in the top 20.

MMR in practice. The candidate pool is ~500 top-scoring clusters after the initial ranker pass. MMR re-ranks these into the final top-20. Similarity is computed as cosine distance between cluster topic-embedding vectors (pre-computed, 128-dim). The full MMR loop runs in < 5 ms for 500 candidates -- fast enough for on-request computation. This is the key to preventing "5 articles about the same politician" feeds.

Source diversity enforcement. Beyond MMR's soft diversity, we apply a hard constraint: after selecting 2 articles from source X, all remaining candidates from source X are removed from the pool. This guarantees no single outlet dominates the feed, even if the user clicks CNN articles exclusively.

Article Ingestion FlowMermaid
flowchart LR A[Article Crawled] --> B[Content Pipeline] B --> B1[Text Extract + Clean] B1 --> B2[Entity + Topic Classify] B2 --> C[SimHash Fingerprint] C --> D{Cluster Match
Hamming dist lte 3?} D -->|Yes| E[Merge into Existing Cluster] D -->|No| F[Create New Cluster] E --> G[Update Cluster Metadata] F --> G G --> H[Ranker: authority x freshness decay] H --> I[Top-K per User with MMR Diversity] I --> J[Feed Cache - Redis 5min TTL]
Interview answer

"Crawlers fetch 100K articles/day from 50K sources via RSS and web scraping. Each article is SimHash-fingerprinted and matched against an LSH index to find near-duplicate clusters. The ranker scores clusters using authority x freshness-decay, personalized by user interest vectors from click history. Feed selection uses greedy MMR to maximize relevance while enforcing diversity -- no more than 2 articles from the same source in top-20. Pre-computed feeds are cached in Redis with 5-minute TTL; breaking news triggers cache invalidation. Total feed latency: < 200 ms."

07

Anti-patterns

🚫
Re-crawl all 50K sources every 5 minutes

Most sources publish 2-3 articles/day. Crawling them every 5 min wastes bandwidth and angers site admins (rate-limit bans).

Better: Adaptive crawl rate based on source publish frequency. CNN: every 2 min. Local blog: every 6 hours. Learn the cadence.
🚫
Show 5 articles from the same source on the same topic

Terrible UX. User sees 5 CNN headlines about the same story. No information gain. Feed feels like a single-source reader.

Better: Cluster near-duplicates; show one representative per cluster with "N sources" badge. Hard cap 2 articles per source in top-20 via MMR.
🚫
Rank purely by click-through rate

Clickbait wins. "You won't believe what happened next" outranks authoritative journalism. Users lose trust; quality sources leave the platform.

Better: Blend authority score (source reputation, PageRank-like) + freshness decay + engagement. Authority acts as a floor that clickbait cannot exceed.
🚫
Use exact string matching for deduplication

Two articles about the same event use different wording. "President signs bill" vs "Bill signed into law by President" are 0% string match but 100% same story.

Better: SimHash on tokenized, stemmed text with stop-word removal. Captures semantic similarity, not lexical identity. Hamming distance 3 threshold catches paraphrased duplicates.
🚫
Pre-compute feeds for all 1B users every 5 minutes

1B x 5-min cycle = 3.3M feed computations/sec. Each feed computation touches ranker + user model + cluster store. Impossible compute budget.

Better: Pre-compute only for active users (DAU ~300M). Compute on-demand for inactive users on first request. Cache with TTL; lazy refresh.
08

Tradeoffs & Design Choices

  • Pre-compute personalized feed vs compute on request. Pre-compute: fast serving (~50 ms from Redis), but stale by up to 5 min. On-request: always fresh, but 200-500 ms latency and higher compute cost. Hybrid: pre-compute base feed, apply lightweight re-rank on request for breaking news injection.
  • SimHash (fast, approximate) vs exact dedup (slow, precise). SimHash at Hamming distance 3 catches ~95% of near-duplicates with < 1 ms per lookup. Exact comparison (TF-IDF cosine) catches 99% but costs ~50 ms. SimHash for real-time pipeline; exact dedup as offline cleanup job.
  • Aggressive personalization vs diverse feed. Deep personalization creates filter bubbles -- user only sees topics they already like. Diverse feed includes serendipitous discovery but may feel less relevant. Tunable alpha in MMR: 0.9 = heavy personalization, 0.5 = balanced exploration.
  • Crawl depth vs latency. Deep-crawling (following links within articles) yields richer content and related articles but adds minutes to ingestion latency. Shallow crawl (RSS + landing page only) is faster but misses context.
  • Real-time trending vs batch trending. Real-time (sliding window on Kafka stream) detects trends in minutes but is compute-heavy. Batch (hourly MapReduce) is cheaper but misses fast-moving stories. Use real-time for top-of-feed "breaking" slot; batch for topic pages.
  • Source-level authority vs article-level quality. Source authority (domain reputation) is stable and cheap to compute but penalizes good articles from low-authority sources. Article-level quality (readability, factual density, expert quotes) is more accurate but requires NLP inference per article. Blend: 70% source authority + 30% article quality signal.
  • Single global ranker vs per-region rankers. Global ranker simplifies operations but cannot capture regional editorial norms (e.g., tabloid-style is normal in UK, not in Japan). Per-region rankers allow tuning but multiply model-training cost. Compromise: global base model + region-specific feature weights.
09

Failure Modes

📰
Propaganda / misinformation ranks high
State-sponsored outlets publish coordinated articles, gaming cluster size and freshness. Authority score helps but is imperfect for new domains.
Mitigation: authority score based on domain age, journalist bylines, fact-check cross-references. New domains start with low authority; manual review for fast-rising unknown sources.
🕸
Crawler overwhelms small news sites
Aggressive crawling of a small-town newspaper with a single web server causes their site to go down.
Mitigation: respect robots.txt crawl-delay; adaptive rate-limit per domain (max 1 req/10s for small sites); back off on 429/503 responses.
Breaking-news latency (crawl-to-display lag)
Major event happens but feed doesn't update for 15 minutes because crawl cycle hasn't hit that source yet.
Mitigation: priority re-crawl triggers -- when trending detection sees velocity spike on a topic, immediately re-crawl top sources for that topic. Push-based ingestion from major wire services (AP, Reuters).
📦
Stale clusters: old articles grouped with today's story
A cluster about "election results" from 2 days ago keeps absorbing today's new election articles because SimHash matches.
Mitigation: cluster TTL -- clusters older than 48 hours are frozen (no new merges). New articles on the same topic create a fresh cluster. Time-gated SimHash: only match against clusters from the last 48 hours.
🔍
Feed cache stampede on cache miss
Popular user's cached feed expires; 100 concurrent requests all trigger expensive re-ranking simultaneously.
Mitigation: cache lock (single-flight pattern) -- first request computes, others wait. Stale-while-revalidate: serve slightly stale feed while recomputing in background.
🌐
Language detection errors cascade into wrong-topic clusters
Article in Portuguese about "futebol" is misclassified as English, placed in wrong topic cluster. User sees foreign-language articles in their feed.
Mitigation: two-stage language detection (fastText + character n-gram fallback). Language mismatch between article and cluster triggers re-classification. User-level language preference as hard filter before feed assembly.
Authority score manipulation by link farms
Bad actors create networks of sites that link to each other, inflating PageRank-style authority scores for propaganda sites.
Mitigation: discount links from domains with low organic traffic. Cross-reference authority with fact-check databases. Manual review queue for domains with rapid authority score increases.
10

Interview Tips

  1. Lead with clustering. "The key insight is that 200 outlets publish the same story -- we cluster near-duplicates via SimHash and show one representative per cluster." This immediately shows you understand what makes news aggregation different from generic feed.
  2. Name the freshness model. "score = authority x e^(-lambda x age) where lambda varies by article type." Concrete formula beats hand-waving about "freshness matters."
  3. Explain MMR for diversity. "Greedy Maximal Marginal Relevance -- each pick maximizes relevance while minimizing similarity to already-selected items." Shows you know recommender-system theory.
  4. Adaptive crawl rate is the politeness story. Don't just say "we crawl." Say "CNN every 2 min, local blog every 6 hours, based on learned publish frequency." Shows operational maturity.
  5. Distinguish from social feed. News feed ranks professional content by authority + freshness. Social feed ranks user-generated content by engagement + social graph. Different ranking signals, different abuse vectors.
  6. Mention the cold-start problem. New user with no click history: fall back to location-based trending + globally popular stories. After ~20 clicks, the interest vector has enough signal for personalization. Explicit topic follows accelerate cold-start.
  7. Crawl-to-display latency is a differentiator. "Breaking news visible in < 5 min: priority re-crawl triggered by trending velocity spike, plus push-based ingestion from wire services (AP, Reuters)." Shows you think about the full pipeline end-to-end.
12

Evolution

1

RSS aggregator + chronological

Simple RSS reader. Subscribe to feeds, display articles newest-first. No ranking, no clustering. Works for 10 sources; breaks at 1K. This was Google Reader (2005-2013) and early Feedly.

2

Web crawl + keyword ranking

Crawl beyond RSS using Googlebot-News. TF-IDF keyword matching for topic classification. Basic relevance scoring by keyword density and recency. Still no deduplication -- same story appears 50 times from different outlets.

3

Clustering + authority scoring

SimHash for near-duplicate detection. PageRank-style authority score for sources (link analysis across the news web). One card per story cluster with "N sources" count. Source diversity enforced. This is where the product becomes genuinely useful -- the signal-to-noise ratio improves dramatically.

4

Personalized feed + engagement ML

User interest vectors derived from click history and explicit topic follows. Learning-to-rank model (LambdaMART or neural) trained on engagement signals. MMR diversity constraint prevents filter bubbles. Pre-computed feeds cached in Redis with 5-min TTL; breaking-news triggers cache bust.

5

LLM-generated summaries + multi-perspective view

LLM summarizes each cluster into a neutral paragraph, citing key sources. "View from left / center / right / international" perspective tabs per story. Fact-check annotations from trusted sources (Snopes, PolitiFact). AI-generated daily topic briefings personalized to user interests.

Next up