(a) Near-duplicate clustering. When 200 outlets publish "President signs climate bill," the user should see one card with "200+ sources." We use SimHash (or MinHash) on article text to produce a 64-bit fingerprint. Two articles with Hamming distance ≤ 3 are considered near-duplicates.
-- SimHash clustering pseudocode
fingerprint = simhash(article.text) -- 64-bit hash
candidates = lookup_lsh(fingerprint, k=3) -- LSH index: find hashes within Hamming distance 3
if candidates:
best_cluster = closest(candidates)
merge(article, best_cluster) -- add to existing story cluster
else:
create_cluster(article) -- new story
-- representative = highest-authority article in cluster
Each cluster stores: representative article (highest authority score), source list, earliest publish time, and topic labels. The feed shows the representative with "N sources" badge.
Why SimHash over MinHash? SimHash produces a single 64-bit fingerprint per document -- constant space regardless of document length. MinHash produces a signature of k hashes (typically k=128), more accurate for Jaccard similarity but 128x more storage. For news articles (similar length, mostly text), SimHash at distance 3 achieves ~95% recall with far less storage. We use MinHash as a secondary offline check for borderline cases (Hamming distance 3-5).
LSH index structure. We split the 64-bit SimHash into 4 bands of 16 bits. Two documents that share at least one identical 16-bit band are candidates. This gives sub-millisecond lookups against the 50M active cluster fingerprints, stored in memory across a sharded hash table.
(b) Freshness decay. A breaking-news article from 10 minutes ago should rank far above yesterday's story. We model this with exponential decay:
score = authority_score * e^(-lambda * age_hours)
-- Breaking news: lambda = 0.5 (half-life ~1.4 hours)
-- Regular news: lambda = 0.1 (half-life ~7 hours)
-- Evergreen: lambda = 0.02 (half-life ~35 hours)
The system classifies each article's decay rate based on topic and velocity (how fast new articles join the cluster). A cluster gaining 50 new articles/hour gets breaking-news lambda.
Adaptive lambda selection. A simple heuristic: if cluster.article_count_last_hour > 20, use breaking lambda (0.5). If the article's topic is in ["obituary", "historical", "explainer"], use evergreen lambda (0.02). Default: regular (0.1). The ML trainer can also learn per-topic lambda from engagement data -- topics where users prefer recency get higher lambda.
(c) Diversity via greedy MMR. Without diversity constraints, a user interested in politics would see 20 politics articles from CNN. We apply Maximal Marginal Relevance:
selected = []
for i in range(top_K):
best = argmax over candidates:
alpha * relevance(candidate, user)
- (1 - alpha) * max_similarity(candidate, selected)
selected.append(best)
-- Constraint: no more than 2 articles from same source in top-20
Each pick maximizes relevance to the user while minimizing similarity to already-selected articles. Alpha = 0.7 balances relevance vs diversity. Hard cap: max 2 articles per source domain in the top 20.
MMR in practice. The candidate pool is ~500 top-scoring clusters after the initial ranker pass. MMR re-ranks these into the final top-20. Similarity is computed as cosine distance between cluster topic-embedding vectors (pre-computed, 128-dim). The full MMR loop runs in < 5 ms for 500 candidates -- fast enough for on-request computation. This is the key to preventing "5 articles about the same politician" feeds.
Source diversity enforcement. Beyond MMR's soft diversity, we apply a hard constraint: after selecting 2 articles from source X, all remaining candidates from source X are removed from the pool. This guarantees no single outlet dominates the feed, even if the user clicks CNN articles exclusively.
Article Ingestion FlowMermaid
flowchart LR
A[Article Crawled] --> B[Content Pipeline]
B --> B1[Text Extract + Clean]
B1 --> B2[Entity + Topic Classify]
B2 --> C[SimHash Fingerprint]
C --> D{Cluster Match
Hamming dist lte 3?}
D -->|Yes| E[Merge into Existing Cluster]
D -->|No| F[Create New Cluster]
E --> G[Update Cluster Metadata]
F --> G
G --> H[Ranker: authority x freshness decay]
H --> I[Top-K per User with MMR Diversity]
I --> J[Feed Cache - Redis 5min TTL]
Interview answer
"Crawlers fetch 100K articles/day from 50K sources via RSS and web scraping. Each article is SimHash-fingerprinted and matched against an LSH index to find near-duplicate clusters. The ranker scores clusters using authority x freshness-decay, personalized by user interest vectors from click history. Feed selection uses greedy MMR to maximize relevance while enforcing diversity -- no more than 2 articles from the same source in top-20. Pre-computed feeds are cached in Redis with 5-minute TTL; breaking news triggers cache invalidation. Total feed latency: < 200 ms."