A typeahead system that returns ranked suggestions on every keystroke — at sub-100ms latency, for billions of daily searches. The challenge isn't finding matches. It's pre-organizing data so matches are already waiting.
Return top 5–8 suggestions for any prefix as the user types
Suggestions ranked by global search frequency
Personal history boosts relevant suggestions for logged-in users
Trending terms surface within minutes of a breaking event
Regional relevance — suggestions differ by geography
Non-Functional
Latency <100ms p99 end-to-end per keystroke
99.99% availability — suggestions are user-facing
Eventual consistency is fine — stale suggestions are harmless
Horizontally scalable reads — Trie servers are stateless replicas
Trie freshness: full rebuild every 6h, hot patches every ~60s
Scope Decision
We exclude spell correction and semantic search — those are separate systems that run before the Trie query, not part of the Trie itself. Typeahead is strictly prefix matching + ranking.
03
Scale Estimation
Metric
Calculation
Result
Daily searches
Google benchmark
5B / day
Keystroke events / day
5B searches × 20 chars avg
100B / day
Raw RPS (before debounce)
100B ÷ 86,400s
~1M RPS
Actual RPS (after debounce)
1M ÷ 5–10× debounce reduction
~100–200K RPS
Unique queries in Trie
Top popular queries only
~10M
Trie size in memory
100M nodes × 150B (top-5 suggestions each)
~15 GB
Global Trie servers
Multiple replicas per region
50–100 servers
Key Insight — Debouncing
The client waits 100–150ms after the last keystroke before firing a request. Fast typists generate a burst of keystrokes in 300ms — without debouncing, 5 requests fire but only the last one matters. Debouncing alone reduces server load by 5–10× at zero cost.
Key Insight — Trie fits in RAM
15 GB per Trie is large but fits comfortably on a single modern server. This is why the Trie lives entirely in memory — no disk reads on the hot path. The entire read path is pure in-memory lookup.
04
API Design
GET/v1/suggestionsFetch top-K suggestions for a prefix
// Request — fires on every (debounced) keystroke
GET /v1/suggestions?q=app&limit=8®ion=us&uid=u_abc123// Response 200 — served from Trie in <5ms
{
"query": "app",
"suggestions": [
{ "text": "apple", "score": 500000000, "source": "global" },
{ "text": "app store", "score": 320000000, "source": "global" },
{ "text": "apple music", "score": 180000000, "source": "personal" }
],
"latency_ms": 4
}
POST/v1/search-eventRecord a completed search (async, fire-and-forget)
// Fires when user hits Enter / selects a result — NOT on every keystroke
{ "query": "apple music", "uid": "u_abc123", "region": "us", "ts": 1704067200 }
// Response 202 Accepted — fire-and-forget, written to Kafka
{ "accepted": true }
Design Note
The GET and POST endpoints are completely decoupled. Reads hit the Trie servers. Writes go to Kafka. They never touch the same system. This is what makes the architecture scalable — read path and write path scale independently.
05
High-Level Architecture
Architecture — Read Path + Write PathSVG Diagram
CDN / Edge Cache
Popular prefix queries are cached at edge PoPs close to users. "app" is searched billions of times — the CDN absorbs most of this traffic before it even reaches the Trie servers. Cache TTL: 60 seconds.
Trie Servers
The entire Trie lives in memory (~15 GB). Each server is a read-only replica — no writes, no coordination. A query walks 3–20 edges and returns pre-computed top-K. Average response: <5ms.
Kafka + Cassandra
Every completed search fires an async event into Kafka — fire-and-forget, no blocking the user. Kafka consumers write to Cassandra for durable storage. Billions of rows per day, pure append workload.
Spark + Flink
Spark runs a batch job every 6 hours — reads Cassandra, computes frequencies, rebuilds the full Trie. Flink runs continuously — detects sudden spikes (breaking news) and injects hot patches into the live Trie.
06
Deep Dive — The Trie
The Core Insight
A Trie doesn't find matches by scanning — it skips entire branches that can't possibly match. To find all words starting with "app", you walk three edges (a→p→p) and everything beneath that node is your answer. The structure of the tree does the filtering.
Interactive Trie — type a prefix
Trie structure
Top suggestions returned
Sequence — Cache Hit vs Cache MissMermaid.js
sequenceDiagram
participant C as Client
participant CDN as CDN Edge
participant LB as Load Balancer
participant T as Trie Server
participant R as User History
C->>CDN: GET /suggestions?q=app
alt CDN Cache Hit (popular prefix)
CDN-->>C: 200 OK — cached suggestions (< 5ms)
else CDN Cache Miss
CDN->>LB: Forward request
LB->>T: Route to Trie server
T->>T: Walk Trie: root→a→p→p
T->>T: Read top-K from node cache
T->>R: Fetch user search history
R-->>T: Recent queries for this user
T->>T: Merge global + personal, re-rank
T-->>LB: Suggestions (< 5ms)
LB-->>CDN: Response
CDN->>CDN: Cache for 60s
CDN-->>C: 200 OK
end
The Trie stores top-K suggestions pre-computed at every node. This is the key optimization. Without it, a query for "app" would need to traverse every descendant node — potentially thousands — collect their scores, sort them, and return the top 5. At 100K RPS, that's catastrophically slow.
With top-K caching, arriving at the "app" node means the answer is already sitting there. The node contains ["apple", "app store", "application", "apple music", "applebees"] — pre-sorted, pre-computed. Return it immediately.
The cost: writes become expensive. When "ChatGPT" trends, every ancestor node must recompute its top-K — "c", "ch", "cha", "chat", "chatg", "chatgp", "chatgpt". This is why we never update the Trie in real-time. We rebuild it offline and swap it in.
Request Flow — Step Through
Client · debounce 150ms→CDN Edge · cache check→Load Balancer · round-robin→Trie Server · in-memory→User History · Redis
Click Next Step to walk through the request flow.
08
Key Design Decisions & Tradeoffs
Chosen
Top-K cached at every node
Pre-compute and store the top 5–8 suggestions directly at each Trie node. Reads are O(prefix length) — just walk edges and return. No traversal, no sorting at query time.
✓ O(1) reads — best for 100K RPS
Alternative
Traverse at query time
Walk all descendants of the prefix node, collect scores, sort, return top-K. Simple to implement and update. But traversal cost is unbounded — "a" has millions of descendants.
✗ Too slow at scale
Chosen
Periodic batch rebuild (6h) + streaming hot patches
Batch for correctness and completeness. Streaming for recency on trending terms. Two tracks, each optimized for their job. Eventual consistency is acceptable — stale suggestions are harmless.
✓ Best of both worlds
Alternative
Real-time Trie updates
Update the Trie on every search event. Frequency changes propagate up all ancestor nodes instantly. But at 5B searches/day, this means constant write storms into a structure that serves 100K reads/second.
✗ Write-read contention is catastrophic
Chosen
Regional Tries
Separate Trie per geographic region, each trained on local search data. "Vi" in India suggests "Virat Kohli". "Vi" in Vietnam suggests "Vietnam Airlines". Culturally relevant + low latency.
✓ Relevance + latency
Alternative
Single Global Trie
One Trie for all users worldwide. Simpler to build and operate. But global popularity dominates regional taste, and serving from a single location means high latency for users far from the data center.
✗ Latency + poor regional relevance
Chosen
Client-side debouncing (150ms)
Fire the request only after the user pauses typing for 150ms. Fast typists generating 5 keystrokes in 300ms trigger 1 request, not 5. 5–10× reduction in server load at zero infrastructure cost.
✓ Free 10× scaling
Alternative
Request on every keystroke
Maximum freshness — suggestions update on literally every character. But most intermediate requests are wasted. By the time "ap" response arrives, the user has typed "app". 1M RPS vs 100K RPS.
✗ 10× unnecessary load
09
What Can Go Wrong
💥
Trie Server Crash
A Trie server holding the entire index in memory goes down. Every user routed to that instance gets no suggestions — a blank dropdown on every keystroke.
→ Fix: Multiple read replicas per region behind a load balancer. Trie is read-only — replicas are trivially consistent. Failed node is removed from rotation automatically.
🔨
Partial Trie Build Failure
The Spark job crashes halfway through a rebuild. You have a half-built Trie in the staging slot. If it gets swapped in, users get broken or missing suggestions for half the alphabet.
→ Fix: Never swap without a health check. Build job writes to staging only. A post-build validation test queries 1,000 known prefixes. If any fail, the old Trie stays live. Stale is better than broken.
📈
Cache Stampede on Trie Swap
Immediately after a swap, CDN caches are cold. All previously cached popular prefixes ("a", "th", "go") miss simultaneously. A wall of traffic hits the Trie servers at once.
→ Fix: Gradual regional rollout — swap one region at a time. Pre-warm the new Trie by replaying top-1000 prefixes before the swap. Monitor p99 latency and roll back if it spikes.
🤖
Artificial Trending Abuse
A bad actor runs a bot that searches "buy cheap watches" a million times. Without filtering, this pollutes the Trie — it surfaces for prefix "bu" globally, pushing out legitimate suggestions.
→ Fix: Rate-limit writes to Kafka by IP. Deduplicate searches from the same user within a time window. Anomaly detection in the Spark job flags unnatural frequency spikes (growth rate > 10× baseline triggers review).
⏱️
Kafka Lag Spike
A traffic surge causes Kafka consumers to fall behind. Raw search events pile up unprocessed. The next Spark run may miss recent data or read incomplete logs.
→ Fix: Kafka retains events durably for 7 days. Consumers catching up doesn't lose data — they just process it late. The next Trie rebuild will include everything. You lose recency, not correctness.
⚠
Anti-patterns
🚫
Scan full text of all suggestions per keystroke
Linear in corpus size; 50K QPS on 100M entries is dead.
✓ Better: Pre-built trie or aho-corasick index; sublinear lookup.
🚫
Hit your authoritative DB on every keystroke
Debounced or not, DB dies.
✓ Better: Dedicated suggestion service with in-memory trie; rebuilt from query logs offline.
🚫
Use generic search index (ElasticSearch) for autocomplete
Overkill for the latency + ranking needs.
✓ Better: Tiny dedicated index with heavy-hitter ranking from query logs.
10
Interview Tips
01
Start with the Trie, not the pipeline. Interviewers want to know you understand the core data structure. Explain why a Trie beats a sorted array for prefix lookups before touching infrastructure. The pipeline is supporting cast — the Trie is the star.
02
Name the write-propagation problem proactively. Say "storing top-K at each node makes reads O(1) but makes writes expensive — updating a single query's frequency requires walking all its ancestor nodes." This shows you understand the tradeoff, not just the optimization.
03
The classic follow-up: "How do you keep the Trie fresh?" This is where most candidates get stuck. Have a clear answer: batch rebuild every 6 hours for accuracy, Flink streaming for hot patches on trending terms. Two tracks. Explain why real-time updates would destroy the read path.
04
Mention debouncing early. It's a simple client-side trick that reduces server load by 10×. Saying this unprompted signals that you think about the full system — not just the backend. Interviewers love it because most candidates forget the client exists.
05
"How would you make it personal?" is a common extension. Answer: a lightweight per-user history index in Redis, blended with the global Trie results at query time. The global Trie stays unchanged — personalization is a separate layer on top, not baked into the Trie itself.
06
The Trie fits in RAM — say it explicitly.~15 GB for 10M queries with top-K caching. This justifies the in-memory architecture and kills the question "why not query a database on every keystroke?" before it's asked.
11
Similar Problems
Typeahead is really two patterns combined: prefix tree lookup + top-K aggregation pipeline. These problems share one or both.
Build the Trie from a static word list. Store it in memory on one server. Rebuild it manually when you want to add new terms. Ship in a week. This is right for most early products — keyboard autocomplete, internal search, form field suggestions.
Phase 2 — Growing Product (1M searches/day)
Add the data pipeline — Kafka → Cassandra → Spark
The Trie now learns from real user searches. A nightly Spark job rebuilds it from the previous day's logs. Add a CDN in front for the most popular prefixes. Most products live here their entire lives. Daily rebuild is fine at this scale.
One global Trie can no longer serve all regions with low latency. Deploy regional Tries. Spark runs every 6 hours instead of daily. Add Flink for trending term detection. Blue/Green swap ensures zero-downtime Trie updates. This is the architecture we designed above.
Phase 4 — Google Scale (5B searches/day)
Custom data structures, ML ranking, per-user models
The Trie is augmented with ML-based ranking — not just frequency but click-through rate, session context, semantic similarity. Personalization becomes a full model per user, not just history lookup. Custom hardware for Trie storage. Almost no engineers operate here — but knowing it demonstrates architectural depth.