Naive top-K: keep a dict term→count; at refresh time, sort and take top 10. Breaks immediately — at 90K term-events/sec, the dict is 50M+ unique keys/day; sorting 50M entries per geo per minute is untenable.
Count-Min Sketch + heavy-hitter tracker. A Count-Min Sketch is a probabilistic counter: constant memory, approximate counts with small over-estimation bias. Paired with a bounded min-heap of top-K heavy hitters, it gives us top-K in fixed memory.
- On each term event: increment CMS for (term, geo, window) → get approximate count. If count > current min-heap threshold, add/update in heap.
- Memory footprint: a CMS with 4 hash functions × 1M counters ≈ 16 MB per (geo, window). Acceptable for hundreds of geos.
- Window management: sliding 5-minute window. Old sketches age out as new ones accumulate.
"Trending" means deviation from baseline, not raw volume. #love is frequent every day — it's not news. A new spike of a previously-rare term IS news. Rank by:
trend_score(term, geo) = current_rate(term, geo) / ema_7day_rate(term, geo)
This EMA (exponentially-weighted moving average) is the baseline — cached per (term, geo) for frequent terms, defaulted for never-before-seen ones. The ratio metric surfaces true spikes, not the ambient background.
Trending Score Computation
Mermaid
flowchart LR
T[Tweet] --> X[Extractor
tokens+entities]
X --> S[Spam filter
bot score]
S -->|pass| C[CMS counter
5-min window]
C --> H[Heavy-hitter
min-heap]
H --> R[Ranker
current/baseline]
B[Baseline EMA
past 7 days] --> R
R --> D[Dedup & filter
policy block]
D --> P[Preview picker
top tweet / LLM]
P --> K[Trend cache
Redis per geo]
Spam / bot filtering. Bots can trivially game raw counts. Each user has a user reputation score (follower ratio, age, engagement authenticity signals); a tweet's contribution is weighted. Additionally, term events are deduped by near-hash (prevent copy-paste spam) and clustered by author — if 100 tweets come from 10 accounts with follow-overlap = 1.0, that's a bot ring; down-weight drastically.
Dedup + entity resolution. "#BarackObama", "#Obama", "Barack Obama" should merge. Entity resolution via a lightweight NLP pipeline (knowledge-graph lookup + fuzzy matching) clusters surface forms into canonical trends. Otherwise the top-10 is noisy and overlapping.
Preview pick. Once ranker decides a term is trending, it runs a quick search for recent high-engagement tweets matching that term; picks one as the preview. Modern systems (2024+) use an LLM to write a 1-line summary ("Markets tumbled after the Fed's rate announcement") instead of just picking a tweet.
Interview answer
"Firehose of tweets → Kafka. A feature extractor emits per-term events into a second topic. Each event hits a Count-Min Sketch per (term, geo, 5-min window) with a min-heap tracking top-100 heavy hitters. Every 5 min, a ranker computes trending score = current_rate / 7-day EMA baseline, filters spam via weighted user reputation, dedupes entity surface forms, picks a preview tweet (or LLM summary), and writes top-10 per geo to a Redis cache. Clients read from cache only."