Twitter Trending

Surface the top-10 trending topics — hashtags, phrases, named entities — from the firehose of all tweets, refreshed every few minutes, per geo and per user. The hard parts: approximate top-K over a high-velocity stream where storing exact counts for every term is impossible; distinguishing a trend from steady background noise (#love is always common, so it's not trending); and filtering spam + bot amplification before a coordinated campaign games the list. X/Twitter does ~500M tweets/day, ~6K/sec; trending updates for 400+ geos every ~5 minutes.

⚡ Core: Stream Top-K + Novelty + Anti-Spam~6K tweets/sec500M tweets/day400+ geo scopes5-min refresh
02

Requirements

Functional
  • Identify trending terms — hashtags, bigrams, named entities — from all tweets
  • Ranking by "trending-ness": how much more frequent NOW than baseline
  • Per-geo trends (country, city), per-language
  • Personalized trends based on user's follow graph + interests
  • Top-10 list refreshed every ~5 min per scope
  • Attach a tweet / summary preview to each trend (what's the story)
Non-Functional
  • End-to-end latency from tweet → trends recompute: < 1 min
  • Scale to ~6K tweets/sec sustained; peaks 20K+ for major events
  • 400+ geo scopes × ~10 languages × personalization → millions of trend lists
  • Resilient to coordinated manipulation (bot rings, hashtag hijacking)
  • Memory-efficient: can't keep exact counts for every term (billions of unique terms)
  • Eventually consistent; approximations are fine if close to truth
03

Scale Estimation

Tweets / sec (avg)
~6K
500M/day ÷ 86,400; peaks 20K during live events
Tokens / tweet
~15
words + hashtags + bigrams after extraction; ~90K token events/sec
Unique terms / day
~50M
most appear <10 times; exact-count table would be ~50 GB/day minimum
Top-K size
K = 10
users see top-10 per list; compute top-100 internally for ranking + dedup
Refresh interval
5 min
trends don't need to flicker every second; 5-min is good UX
Geo scopes
~450
countries + major cities; each gets its own trends computation
04

API Design

GET/trends?woeid=23424977

Get trends for a Where-On-Earth ID (geo code). Returns {as_of, trends: [{name, volume, preview_tweet_id, url}], locations: [{country, ...}]}. Served from pre-computed cache; < 50 ms response.

GET/trends/personalized

Returns trends ranked by what's relevant to the caller's graph + interests. Requires auth; combines global trending with follow-graph signal.

GET/search?q=TREND&type=top

When user taps a trend: search for that term. Top tweets attached to the trend list come from this search run at trend-refresh time.

POST/trends/admin/block-term internal

Manually block a term from trending (policy violations, hate speech, spam). Small escape hatch layered on top of ML-based filtering.

GET/trends/available

List all WOEID locations that have trend data. Used by clients to populate the geo picker.

05

Architecture

A streaming pipeline. Tweets flow through Kafka; a feature extractor emits term events; a sketch service maintains approximate counts per (term, geo, window); a trend ranker runs every few minutes to compute top-K per geo, filter spam, pick a preview tweet, and write to the serving cache. Clients read from the cache only — no expensive queries on the hot path.

Streaming Pipeline — Firehose to Trends SVG
Tweet svc firehose source Kafka: tweets ~6K/sec Feature extractor tokens + entities Kafka: terms ~90K/sec Spam filter ML scorer User reps bot scores Sketch svc CountMin + TopK per (term, geo, window) Redis / in-mem sketches 5-min windows Trend ranker score + dedup + LLM runs every 5 min Historical baseline EMA of past 7 days Cassandra term history Preview picker pick tweet / LLM desc Trend cache Redis: top-K per geo Personalization svc rerank per user Client GET /trends API gateway geo detect Trends svc cache reader
Request Flow — Step Through
Firehose · all tweetsExtractor · tokens + entitiesSpam filter · bot-score weightCMS + top-K · per (term,geo,win)EMA baseline · past 7dRanker · current/baselineCache · top-10 per geo
Click Next Step to walk through the request flow.
06

Deep Dive — Top-K on a Firehose + "Trending" as Deviation

Naive top-K: keep a dict term→count; at refresh time, sort and take top 10. Breaks immediately — at 90K term-events/sec, the dict is 50M+ unique keys/day; sorting 50M entries per geo per minute is untenable.

Count-Min Sketch + heavy-hitter tracker. A Count-Min Sketch is a probabilistic counter: constant memory, approximate counts with small over-estimation bias. Paired with a bounded min-heap of top-K heavy hitters, it gives us top-K in fixed memory.

  • On each term event: increment CMS for (term, geo, window) → get approximate count. If count > current min-heap threshold, add/update in heap.
  • Memory footprint: a CMS with 4 hash functions × 1M counters ≈ 16 MB per (geo, window). Acceptable for hundreds of geos.
  • Window management: sliding 5-minute window. Old sketches age out as new ones accumulate.

"Trending" means deviation from baseline, not raw volume. #love is frequent every day — it's not news. A new spike of a previously-rare term IS news. Rank by:

trend_score(term, geo) = current_rate(term, geo) / ema_7day_rate(term, geo)

This EMA (exponentially-weighted moving average) is the baseline — cached per (term, geo) for frequent terms, defaulted for never-before-seen ones. The ratio metric surfaces true spikes, not the ambient background.

Trending Score Computation Mermaid
flowchart LR T[Tweet] --> X[Extractor
tokens+entities] X --> S[Spam filter
bot score] S -->|pass| C[CMS counter
5-min window] C --> H[Heavy-hitter
min-heap] H --> R[Ranker
current/baseline] B[Baseline EMA
past 7 days] --> R R --> D[Dedup & filter
policy block] D --> P[Preview picker
top tweet / LLM] P --> K[Trend cache
Redis per geo]

Spam / bot filtering. Bots can trivially game raw counts. Each user has a user reputation score (follower ratio, age, engagement authenticity signals); a tweet's contribution is weighted. Additionally, term events are deduped by near-hash (prevent copy-paste spam) and clustered by author — if 100 tweets come from 10 accounts with follow-overlap = 1.0, that's a bot ring; down-weight drastically.

Dedup + entity resolution. "#BarackObama", "#Obama", "Barack Obama" should merge. Entity resolution via a lightweight NLP pipeline (knowledge-graph lookup + fuzzy matching) clusters surface forms into canonical trends. Otherwise the top-10 is noisy and overlapping.

Preview pick. Once ranker decides a term is trending, it runs a quick search for recent high-engagement tweets matching that term; picks one as the preview. Modern systems (2024+) use an LLM to write a 1-line summary ("Markets tumbled after the Fed's rate announcement") instead of just picking a tweet.

Interview answer

"Firehose of tweets → Kafka. A feature extractor emits per-term events into a second topic. Each event hits a Count-Min Sketch per (term, geo, 5-min window) with a min-heap tracking top-100 heavy hitters. Every 5 min, a ranker computes trending score = current_rate / 7-day EMA baseline, filters spam via weighted user reputation, dedupes entity surface forms, picks a preview tweet (or LLM summary), and writes top-10 per geo to a Redis cache. Clients read from cache only."

07

Tradeoffs & Design Choices

  • Approximate (CMS) vs exact counts. Exact counts require 10–100× more memory for no meaningful accuracy gain in top-K. CMS's 1–5% over-estimation is invisible at top-K granularity.
  • Raw frequency vs deviation ranking. Raw frequency = "#love always wins." Deviation (current / baseline) is what makes trends feel like news. But deviation is noisy for low-count terms; floor the rate to avoid "brand new term with 10 tweets beats viral trend with 100K" false positives.
  • 5-min windows vs 1-min or 1-hour. 5 min is the UX sweet spot: responsive but stable. 1 min causes flicker; 1 hour misses breaking news. Keep multiple window sizes if needed for different products.
  • Geo detection. IP-geo for anonymous users; profile location for logged-in. Both are unreliable. Mitigation: use multiple signals + cluster by language; don't show "#MelbourneCupDay" globally.
  • Centralized ranker vs distributed. Ranking every geo from one process is simpler; distributing by geo shards is necessary at planet scale. Each geo computes its own top-K; a thin aggregator handles global/cross-geo "world" trends.
  • Human-in-the-loop. ML alone isn't enough. Teams maintain deny-lists for slurs, false positives, obvious manipulation. Be explicit about this in interviews — pretending ML solves content safety is naive.
08

Failure Modes

🤖
Coordinated hashtag campaign
Bot ring pushes #FakeOutrage to trending with 10K tweets from 50 bot accounts over 10 min.
→ Mitigation: author-diversity metric — require trend to have >N unique non-correlated authors. Low-reputation accounts weighted 10× less. Suspected coordinated accounts clustered by graph embedding.
📉
Trend-ranker pipeline falls behind
Kafka consumer lag grows during a major event (World Cup goal). Trends show 30 min stale.
→ Mitigation: horizontally scale consumer workers; partition by geo so no single partition bottlenecks; alerting on pipeline lag triggers auto-scale.
🔄
Zombie trends from EMA staleness
A term spikes, baseline EMA catches up, trend_score → 1, but term is still in top-K because current rate is still high.
→ Mitigation: momentum decay — subtract penalty for trends already shown >N minutes. Short-circuit trends with no growth (d/dt current_rate < 0) after some time.
🧹
Slur or policy-violation term trends
A horrific term trends due to news coverage or targeted campaign. Platform liability + safety problem.
→ Mitigation: admin deny-list; ML classifier for hate/violence/etc; trend results always pass through a safety filter before cache write; ops team on-call to manually block edge cases.
🌍
Geo mis-attribution
Most tweets lack precise geo. IP-based detection is wrong for VPN users. Mumbai-specific trend shows up in Bangalore.
→ Mitigation: combine IP + user-profile location + language heuristics; default to country-level when city confidence is low; show "Country trends" when precise geo is unknown.
📸
Preview tweet becomes inappropriate
Chosen preview tweet later violates policy (edited, deleted, reported). Users see a now-stale or offensive snippet.
→ Mitigation: preview has TTL; regenerated at next refresh; deletion events invalidate cache immediately. Fallback to LLM-generated neutral summary when no high-quality tweet found.
09

Interview Tips

  1. Lead with streaming top-K + CMS. Candidates who say "store counts in a DB and sort" fail here. The probabilistic data structure answer is expected.
  2. Clarify "trending" early. Deviation from baseline, not raw volume. This single sentence separates good from great answers.
  3. Anti-spam / anti-bot is load-bearing. Often skipped. Mention user-reputation weighting + author-diversity constraint.
  4. Geo scoping + personalization are separate concerns. Geo is pre-compute; personalization is runtime-rerank on top of geo's top-K. Don't mush them together.
  5. LLM-generated trend summaries is modern. Shows you know the 2024+ state of the art; a generic trend-picking design sounds dated.
11

Evolution

1

MVP — exact counts in Redis

Hashmap of term → count with per-minute buckets. Sort + take top 10. Works to ~10M tweets/day.

2

Count-Min Sketch + top-K heap

Switch to probabilistic counts. Handle ~100M+ unique terms/day in fixed memory. Per-geo sketches.

3

Deviation-based ranking + EMA baseline

"Trending" = current / baseline. No more "#love tops every day." 7-day EMA cached per (term, geo).

4

Spam filtering + entity resolution

User-reputation-weighted counts. Bot-ring detection via graph clustering. Entity merging for surface-form variants.

5

Personalized + LLM summaries

Trends reranked per user based on follow graph and interests. LLM generates 1-line description per trend ("Markets tumbled after Fed rate cut"). Modern X Explore tab.

Next up