Twitter Trending

Surface the top-10 trending topics — hashtags, phrases, named entities — from the firehose of all tweets, refreshed every few minutes, per geo and per user. The hard parts: approximate top-K over a high-velocity stream where storing exact counts for every term is impossible; distinguishing a trend from steady background noise (#love is always common, so it's not trending); and filtering spam + bot amplification before a coordinated campaign games the list. X/Twitter does ~500M tweets/day, ~6K/sec; trending updates for 400+ geos every ~5 minutes.

⚡ Core: Stream Top-K + Novelty + Anti-Spam~6K tweets/sec500M tweets/day400+ geo scopes5-min refresh

Requirements

Functional

Identify trending terms — hashtags, bigrams, named entities — from all tweets
Ranking by "trending-ness": how much more frequent NOW than baseline
Per-geo trends (country, city), per-language
Personalized trends based on user's follow graph + interests
Top-10 list refreshed every ~5 min per scope
Attach a tweet / summary preview to each trend (what's the story)

Non-Functional

End-to-end latency from tweet → trends recompute: < 1 min
Scale to ~6K tweets/sec sustained; peaks 20K+ for major events
400+ geo scopes × ~10 languages × personalization → millions of trend lists
Resilient to coordinated manipulation (bot rings, hashtag hijacking)
Memory-efficient: can't keep exact counts for every term (billions of unique terms)
Eventually consistent; approximations are fine if close to truth

Scale Estimation

Tweets / sec (avg)

~6K

500M/day ÷ 86,400; peaks 20K during live events

Tokens / tweet

~15

words + hashtags + bigrams after extraction; ~90K token events/sec

Unique terms / day

~50M

most appear <10 times; exact-count table would be ~50 GB/day minimum

Top-K size

K = 10

users see top-10 per list; compute top-100 internally for ranking + dedup

Refresh interval

5 min

trends don't need to flicker every second; 5-min is good UX

Geo scopes

~450

countries + major cities; each gets its own trends computation

API Design

GET/trends?woeid=23424977

Get trends for a Where-On-Earth ID (geo code). Returns {as_of, trends: [{name, volume, preview_tweet_id, url}], locations: [{country, ...}]}. Served from pre-computed cache; < 50 ms response.

GET/trends/personalized

Returns trends ranked by what's relevant to the caller's graph + interests. Requires auth; combines global trending with follow-graph signal.

GET/search?q=TREND&type=top

When user taps a trend: search for that term. Top tweets attached to the trend list come from this search run at trend-refresh time.

POST/trends/admin/block-term internal

Manually block a term from trending (policy violations, hate speech, spam). Small escape hatch layered on top of ML-based filtering.

GET/trends/available

List all WOEID locations that have trend data. Used by clients to populate the geo picker.

Architecture

A streaming pipeline. Tweets flow through Kafka; a feature extractor emits term events; a sketch service maintains approximate counts per (term, geo, window); a trend ranker runs every few minutes to compute top-K per geo, filter spam, pick a preview tweet, and write to the serving cache. Clients read from the cache only — no expensive queries on the hot path.

Streaming Pipeline — Firehose to Trends SVG

Request Flow — Step Through

Firehose · all tweets→Extractor · tokens + entities→Spam filter · bot-score weight→CMS + top-K · per (term,geo,win)→EMA baseline · past 7d→Ranker · current/baseline→Cache · top-10 per geo

Click Next Step to walk through the request flow.

Deep Dive — Top-K on a Firehose + "Trending" as Deviation

Naive top-K: keep a dict term→count; at refresh time, sort and take top 10. Breaks immediately — at 90K term-events/sec, the dict is 50M+ unique keys/day; sorting 50M entries per geo per minute is untenable.

Count-Min Sketch + heavy-hitter tracker. A Count-Min Sketch is a probabilistic counter: constant memory, approximate counts with small over-estimation bias. Paired with a bounded min-heap of top-K heavy hitters, it gives us top-K in fixed memory.

On each term event: increment CMS for (term, geo, window) → get approximate count. If count > current min-heap threshold, add/update in heap.
Memory footprint: a CMS with 4 hash functions × 1M counters ≈ 16 MB per (geo, window). Acceptable for hundreds of geos.
Window management: sliding 5-minute window. Old sketches age out as new ones accumulate.

"Trending" means deviation from baseline, not raw volume. #love is frequent every day — it's not news. A new spike of a previously-rare term IS news. Rank by:

trend_score(term, geo) = current_rate(term, geo) / ema_7day_rate(term, geo)

This EMA (exponentially-weighted moving average) is the baseline — cached per (term, geo) for frequent terms, defaulted for never-before-seen ones. The ratio metric surfaces true spikes, not the ambient background.

Trending Score Computation Mermaid

flowchart LR T[Tweet] --> X[Extractor
tokens+entities] X --> S[Spam filter
bot score] S -->|pass| C[CMS counter
5-min window] C --> H[Heavy-hitter
min-heap] H --> R[Ranker
current/baseline] B[Baseline EMA
past 7 days] --> R R --> D[Dedup & filter
policy block] D --> P[Preview picker
top tweet / LLM] P --> K[Trend cache
Redis per geo]

Spam / bot filtering. Bots can trivially game raw counts. Each user has a user reputation score (follower ratio, age, engagement authenticity signals); a tweet's contribution is weighted. Additionally, term events are deduped by near-hash (prevent copy-paste spam) and clustered by author — if 100 tweets come from 10 accounts with follow-overlap = 1.0, that's a bot ring; down-weight drastically.

Dedup + entity resolution. "#BarackObama", "#Obama", "Barack Obama" should merge. Entity resolution via a lightweight NLP pipeline (knowledge-graph lookup + fuzzy matching) clusters surface forms into canonical trends. Otherwise the top-10 is noisy and overlapping.

Preview pick. Once ranker decides a term is trending, it runs a quick search for recent high-engagement tweets matching that term; picks one as the preview. Modern systems (2024+) use an LLM to write a 1-line summary ("Markets tumbled after the Fed's rate announcement") instead of just picking a tweet.

Interview answer

"Firehose of tweets → Kafka. A feature extractor emits per-term events into a second topic. Each event hits a Count-Min Sketch per (term, geo, 5-min window) with a min-heap tracking top-100 heavy hitters. Every 5 min, a ranker computes trending score = current_rate / 7-day EMA baseline, filters spam via weighted user reputation, dedupes entity surface forms, picks a preview tweet (or LLM summary), and writes top-10 per geo to a Redis cache. Clients read from cache only."

Tradeoffs & Design Choices

Approximate (CMS) vs exact counts. Exact counts require 10–100× more memory for no meaningful accuracy gain in top-K. CMS's 1–5% over-estimation is invisible at top-K granularity.
Raw frequency vs deviation ranking. Raw frequency = "#love always wins." Deviation (current / baseline) is what makes trends feel like news. But deviation is noisy for low-count terms; floor the rate to avoid "brand new term with 10 tweets beats viral trend with 100K" false positives.
5-min windows vs 1-min or 1-hour. 5 min is the UX sweet spot: responsive but stable. 1 min causes flicker; 1 hour misses breaking news. Keep multiple window sizes if needed for different products.
Geo detection. IP-geo for anonymous users; profile location for logged-in. Both are unreliable. Mitigation: use multiple signals + cluster by language; don't show "#MelbourneCupDay" globally.
Centralized ranker vs distributed. Ranking every geo from one process is simpler; distributing by geo shards is necessary at planet scale. Each geo computes its own top-K; a thin aggregator handles global/cross-geo "world" trends.
Human-in-the-loop. ML alone isn't enough. Teams maintain deny-lists for slurs, false positives, obvious manipulation. Be explicit about this in interviews — pretending ML solves content safety is naive.

Failure Modes

🤖

Coordinated hashtag campaign

Bot ring pushes #FakeOutrage to trending with 10K tweets from 50 bot accounts over 10 min.

→ Mitigation: author-diversity metric — require trend to have >N unique non-correlated authors. Low-reputation accounts weighted 10× less. Suspected coordinated accounts clustered by graph embedding.

📉

Trend-ranker pipeline falls behind

Kafka consumer lag grows during a major event (World Cup goal). Trends show 30 min stale.

→ Mitigation: horizontally scale consumer workers; partition by geo so no single partition bottlenecks; alerting on pipeline lag triggers auto-scale.

🔄

Zombie trends from EMA staleness

A term spikes, baseline EMA catches up, trend_score → 1, but term is still in top-K because current rate is still high.

→ Mitigation: momentum decay — subtract penalty for trends already shown >N minutes. Short-circuit trends with no growth (d/dt current_rate < 0) after some time.

🧹

Slur or policy-violation term trends

A horrific term trends due to news coverage or targeted campaign. Platform liability + safety problem.

→ Mitigation: admin deny-list; ML classifier for hate/violence/etc; trend results always pass through a safety filter before cache write; ops team on-call to manually block edge cases.

🌍

Geo mis-attribution

Most tweets lack precise geo. IP-based detection is wrong for VPN users. Mumbai-specific trend shows up in Bangalore.

→ Mitigation: combine IP + user-profile location + language heuristics; default to country-level when city confidence is low; show "Country trends" when precise geo is unknown.

📸

Preview tweet becomes inappropriate

Chosen preview tweet later violates policy (edited, deleted, reported). Users see a now-stale or offensive snippet.

→ Mitigation: preview has TTL; regenerated at next refresh; deletion events invalidate cache immediately. Fallback to LLM-generated neutral summary when no high-quality tweet found.

Interview Tips

Lead with streaming top-K + CMS. Candidates who say "store counts in a DB and sort" fail here. The probabilistic data structure answer is expected.
Clarify "trending" early. Deviation from baseline, not raw volume. This single sentence separates good from great answers.
Anti-spam / anti-bot is load-bearing. Often skipped. Mention user-reputation weighting + author-diversity constraint.
Geo scoping + personalization are separate concerns. Geo is pre-compute; personalization is runtime-rerank on top of geo's top-K. Don't mush them together.
LLM-generated trend summaries is modern. Shows you know the 2024+ state of the art; a generic trend-picking design sounds dated.

Evolution

MVP — exact counts in Redis

Hashmap of term → count with per-minute buckets. Sort + take top 10. Works to ~10M tweets/day.

Count-Min Sketch + top-K heap

Switch to probabilistic counts. Handle ~100M+ unique terms/day in fixed memory. Per-geo sketches.

Deviation-based ranking + EMA baseline

"Trending" = current / baseline. No more "#love tops every day." 7-day EMA cached per (term, geo).

Spam filtering + entity resolution

User-reputation-weighted counts. Bot-ring detection via graph clustering. Entity merging for surface-form variants.

Personalized + LLM summaries

Trends reranked per user based on follow graph and interests. LLM generates 1-line description per trend ("Markets tumbled after Fed rate cut"). Modern X Explore tab.

📺

References & Videos

Design Twitter Trending Topics

Gaurav Sen · 15 min

Top-K Heavy Hitters — Streaming Algorithms

ByteByteGo · 8 min

Building a Complete Tweet Index

Next up

Twitter / News Feed

Same firehose as input; different output product

Read →

PROBLEM

Leaderboard

Top-K ranking with different score function

Read →

Twitter Trending

Requirements

Scale Estimation

API Design

Architecture

Deep Dive — Top-K on a Firehose + "Trending" as Deviation

Tradeoffs & Design Choices

Failure Modes

Interview Tips

Similar Problems

Twitter / News Feed

Leaderboard

Count Active Users

Recommendation Algorithm

Search Engine

Evolution

MVP — exact counts in Redis

Count-Min Sketch + top-K heap

Deviation-based ranking + EMA baseline

Spam filtering + entity resolution

Personalized + LLM summaries

References & Videos

Twitter / News Feed

Leaderboard