Search Autocomplete

A typeahead system that returns ranked suggestions on every keystroke — at sub-100ms latency, for billions of daily searches. The challenge isn't finding matches. It's pre-organizing data so matches are already waiting.

⚡ Core: Trie + Top-K caching5B searches/dayRead-heavy<100ms p99Eventually consistentRegional Tries
02

Requirements

Functional
  • Return top 5–8 suggestions for any prefix as the user types
  • Suggestions ranked by global search frequency
  • Personal history boosts relevant suggestions for logged-in users
  • Trending terms surface within minutes of a breaking event
  • Regional relevance — suggestions differ by geography
Non-Functional
  • Latency <100ms p99 end-to-end per keystroke
  • 99.99% availability — suggestions are user-facing
  • Eventual consistency is fine — stale suggestions are harmless
  • Horizontally scalable reads — Trie servers are stateless replicas
  • Trie freshness: full rebuild every 6h, hot patches every ~60s
Scope Decision

We exclude spell correction and semantic search — those are separate systems that run before the Trie query, not part of the Trie itself. Typeahead is strictly prefix matching + ranking.

03

Scale Estimation

MetricCalculationResult
Daily searches Google benchmark 5B / day
Keystroke events / day 5B searches × 20 chars avg 100B / day
Raw RPS (before debounce) 100B ÷ 86,400s ~1M RPS
Actual RPS (after debounce) 1M ÷ 5–10× debounce reduction ~100–200K RPS
Unique queries in Trie Top popular queries only ~10M
Trie size in memory 100M nodes × 150B (top-5 suggestions each) ~15 GB
Global Trie servers Multiple replicas per region 50–100 servers
Key Insight — Debouncing

The client waits 100–150ms after the last keystroke before firing a request. Fast typists generate a burst of keystrokes in 300ms — without debouncing, 5 requests fire but only the last one matters. Debouncing alone reduces server load by 5–10× at zero cost.

Key Insight — Trie fits in RAM

15 GB per Trie is large but fits comfortably on a single modern server. This is why the Trie lives entirely in memory — no disk reads on the hot path. The entire read path is pure in-memory lookup.

04

API Design

GET /v1/suggestions Fetch top-K suggestions for a prefix
// Request — fires on every (debounced) keystroke GET /v1/suggestions?q=app&limit=8®ion=us&uid=u_abc123 // Response 200 — served from Trie in <5ms { "query": "app", "suggestions": [ { "text": "apple", "score": 500000000, "source": "global" }, { "text": "app store", "score": 320000000, "source": "global" }, { "text": "apple music", "score": 180000000, "source": "personal" } ], "latency_ms": 4 }
POST /v1/search-event Record a completed search (async, fire-and-forget)
// Fires when user hits Enter / selects a result — NOT on every keystroke { "query": "apple music", "uid": "u_abc123", "region": "us", "ts": 1704067200 } // Response 202 Accepted — fire-and-forget, written to Kafka { "accepted": true }
Design Note

The GET and POST endpoints are completely decoupled. Reads hit the Trie servers. Writes go to Kafka. They never touch the same system. This is what makes the architecture scalable — read path and write path scale independently.

05

High-Level Architecture

Architecture — Read Path + Write Path SVG Diagram
READ PATH WRITE PATH CLIENT Browser debounce 150ms GET /suggestions EDGE CDN / Cache Regional PoPs cache miss INFRA Load Balancer Round-robin CORE COMPONENT Trie Servers Trie in memory Top-K at each node PERSONAL User History Redis per-user merge+rank CLIENT Search Event on Enter key QUEUE Kafka Async ingestion STORAGE Cassandra Raw search logs STREAMING Flink BATCH Spark Job Every 6 hours HOT PATCH Trending BUILDER Trie Builder Blue/Green swap swap Trie Sync Async / periodic Trie swap
CDN / Edge Cache

Popular prefix queries are cached at edge PoPs close to users. "app" is searched billions of times — the CDN absorbs most of this traffic before it even reaches the Trie servers. Cache TTL: 60 seconds.

Trie Servers

The entire Trie lives in memory (~15 GB). Each server is a read-only replica — no writes, no coordination. A query walks 3–20 edges and returns pre-computed top-K. Average response: <5ms.

Kafka + Cassandra

Every completed search fires an async event into Kafka — fire-and-forget, no blocking the user. Kafka consumers write to Cassandra for durable storage. Billions of rows per day, pure append workload.

Spark + Flink

Spark runs a batch job every 6 hours — reads Cassandra, computes frequencies, rebuilds the full Trie. Flink runs continuously — detects sudden spikes (breaking news) and injects hot patches into the live Trie.

06

Deep Dive — The Trie

The Core Insight

A Trie doesn't find matches by scanning — it skips entire branches that can't possibly match. To find all words starting with "app", you walk three edges (a→p→p) and everything beneath that node is your answer. The structure of the tree does the filtering.

Interactive Trie — type a prefix
Trie structure
Top suggestions returned
Sequence — Cache Hit vs Cache Miss Mermaid.js
sequenceDiagram participant C as Client participant CDN as CDN Edge participant LB as Load Balancer participant T as Trie Server participant R as User History C->>CDN: GET /suggestions?q=app alt CDN Cache Hit (popular prefix) CDN-->>C: 200 OK — cached suggestions (< 5ms) else CDN Cache Miss CDN->>LB: Forward request LB->>T: Route to Trie server T->>T: Walk Trie: root→a→p→p T->>T: Read top-K from node cache T->>R: Fetch user search history R-->>T: Recent queries for this user T->>T: Merge global + personal, re-rank T-->>LB: Suggestions (< 5ms) LB-->>CDN: Response CDN->>CDN: Cache for 60s CDN-->>C: 200 OK end

The Trie stores top-K suggestions pre-computed at every node. This is the key optimization. Without it, a query for "app" would need to traverse every descendant node — potentially thousands — collect their scores, sort them, and return the top 5. At 100K RPS, that's catastrophically slow.

With top-K caching, arriving at the "app" node means the answer is already sitting there. The node contains ["apple", "app store", "application", "apple music", "applebees"] — pre-sorted, pre-computed. Return it immediately.

The cost: writes become expensive. When "ChatGPT" trends, every ancestor node must recompute its top-K — "c", "ch", "cha", "chat", "chatg", "chatgp", "chatgpt". This is why we never update the Trie in real-time. We rebuild it offline and swap it in.

Request Flow — Step Through
Client · debounce 150msCDN Edge · cache checkLoad Balancer · round-robinTrie Server · in-memoryUser History · Redis
Click Next Step to walk through the request flow.
08

Key Design Decisions & Tradeoffs

Chosen
Top-K cached at every node

Pre-compute and store the top 5–8 suggestions directly at each Trie node. Reads are O(prefix length) — just walk edges and return. No traversal, no sorting at query time.

✓ O(1) reads — best for 100K RPS
Alternative
Traverse at query time

Walk all descendants of the prefix node, collect scores, sort, return top-K. Simple to implement and update. But traversal cost is unbounded — "a" has millions of descendants.

✗ Too slow at scale
Chosen
Periodic batch rebuild (6h) + streaming hot patches

Batch for correctness and completeness. Streaming for recency on trending terms. Two tracks, each optimized for their job. Eventual consistency is acceptable — stale suggestions are harmless.

✓ Best of both worlds
Alternative
Real-time Trie updates

Update the Trie on every search event. Frequency changes propagate up all ancestor nodes instantly. But at 5B searches/day, this means constant write storms into a structure that serves 100K reads/second.

✗ Write-read contention is catastrophic
Chosen
Regional Tries

Separate Trie per geographic region, each trained on local search data. "Vi" in India suggests "Virat Kohli". "Vi" in Vietnam suggests "Vietnam Airlines". Culturally relevant + low latency.

✓ Relevance + latency
Alternative
Single Global Trie

One Trie for all users worldwide. Simpler to build and operate. But global popularity dominates regional taste, and serving from a single location means high latency for users far from the data center.

✗ Latency + poor regional relevance
Chosen
Client-side debouncing (150ms)

Fire the request only after the user pauses typing for 150ms. Fast typists generating 5 keystrokes in 300ms trigger 1 request, not 5. 5–10× reduction in server load at zero infrastructure cost.

✓ Free 10× scaling
Alternative
Request on every keystroke

Maximum freshness — suggestions update on literally every character. But most intermediate requests are wasted. By the time "ap" response arrives, the user has typed "app". 1M RPS vs 100K RPS.

✗ 10× unnecessary load
09

What Can Go Wrong

💥
Trie Server Crash

A Trie server holding the entire index in memory goes down. Every user routed to that instance gets no suggestions — a blank dropdown on every keystroke.

→ Fix: Multiple read replicas per region behind a load balancer. Trie is read-only — replicas are trivially consistent. Failed node is removed from rotation automatically.
🔨
Partial Trie Build Failure

The Spark job crashes halfway through a rebuild. You have a half-built Trie in the staging slot. If it gets swapped in, users get broken or missing suggestions for half the alphabet.

→ Fix: Never swap without a health check. Build job writes to staging only. A post-build validation test queries 1,000 known prefixes. If any fail, the old Trie stays live. Stale is better than broken.
📈
Cache Stampede on Trie Swap

Immediately after a swap, CDN caches are cold. All previously cached popular prefixes ("a", "th", "go") miss simultaneously. A wall of traffic hits the Trie servers at once.

→ Fix: Gradual regional rollout — swap one region at a time. Pre-warm the new Trie by replaying top-1000 prefixes before the swap. Monitor p99 latency and roll back if it spikes.
🤖
Artificial Trending Abuse

A bad actor runs a bot that searches "buy cheap watches" a million times. Without filtering, this pollutes the Trie — it surfaces for prefix "bu" globally, pushing out legitimate suggestions.

→ Fix: Rate-limit writes to Kafka by IP. Deduplicate searches from the same user within a time window. Anomaly detection in the Spark job flags unnatural frequency spikes (growth rate > 10× baseline triggers review).
⏱️
Kafka Lag Spike

A traffic surge causes Kafka consumers to fall behind. Raw search events pile up unprocessed. The next Spark run may miss recent data or read incomplete logs.

→ Fix: Kafka retains events durably for 7 days. Consumers catching up doesn't lose data — they just process it late. The next Trie rebuild will include everything. You lose recency, not correctness.

Anti-patterns

🚫
Scan full text of all suggestions per keystroke

Linear in corpus size; 50K QPS on 100M entries is dead.

✓ Better: Pre-built trie or aho-corasick index; sublinear lookup.
🚫
Hit your authoritative DB on every keystroke

Debounced or not, DB dies.

✓ Better: Dedicated suggestion service with in-memory trie; rebuilt from query logs offline.
🚫
Use generic search index (ElasticSearch) for autocomplete

Overkill for the latency + ranking needs.

✓ Better: Tiny dedicated index with heavy-hitter ranking from query logs.
10

Interview Tips

01

Start with the Trie, not the pipeline. Interviewers want to know you understand the core data structure. Explain why a Trie beats a sorted array for prefix lookups before touching infrastructure. The pipeline is supporting cast — the Trie is the star.

02

Name the write-propagation problem proactively. Say "storing top-K at each node makes reads O(1) but makes writes expensive — updating a single query's frequency requires walking all its ancestor nodes." This shows you understand the tradeoff, not just the optimization.

03

The classic follow-up: "How do you keep the Trie fresh?" This is where most candidates get stuck. Have a clear answer: batch rebuild every 6 hours for accuracy, Flink streaming for hot patches on trending terms. Two tracks. Explain why real-time updates would destroy the read path.

04

Mention debouncing early. It's a simple client-side trick that reduces server load by 10×. Saying this unprompted signals that you think about the full system — not just the backend. Interviewers love it because most candidates forget the client exists.

05

"How would you make it personal?" is a common extension. Answer: a lightweight per-user history index in Redis, blended with the global Trie results at query time. The global Trie stays unchanged — personalization is a separate layer on top, not baked into the Trie itself.

06

The Trie fits in RAM — say it explicitly. ~15 GB for 10M queries with top-K caching. This justifies the in-memory architecture and kills the question "why not query a database on every keystroke?" before it's asked.

12

How the Design Evolves

Phase 1 — Product / Prototype
In-memory Trie, single server, manual updates

Build the Trie from a static word list. Store it in memory on one server. Rebuild it manually when you want to add new terms. Ship in a week. This is right for most early products — keyboard autocomplete, internal search, form field suggestions.

Phase 2 — Growing Product (1M searches/day)
Add the data pipeline — Kafka → Cassandra → Spark

The Trie now learns from real user searches. A nightly Spark job rebuilds it from the previous day's logs. Add a CDN in front for the most popular prefixes. Most products live here their entire lives. Daily rebuild is fine at this scale.

Phase 3 — Scale (100M searches/day)
Regional Tries + Blue/Green swaps + Flink streaming

One global Trie can no longer serve all regions with low latency. Deploy regional Tries. Spark runs every 6 hours instead of daily. Add Flink for trending term detection. Blue/Green swap ensures zero-downtime Trie updates. This is the architecture we designed above.

Phase 4 — Google Scale (5B searches/day)
Custom data structures, ML ranking, per-user models

The Trie is augmented with ML-based ranking — not just frequency but click-through rate, session context, semantic similarity. Personalization becomes a full model per user, not just history lookup. Custom hardware for Trie storage. Almost no engineers operate here — but knowing it demonstrates architectural depth.

Next up