A web-scale search engine — crawl the public web, build an inverted index, serve queries in under 200 ms with high-quality ranked results. Google handles ~100K queries per second over an index of hundreds of billions of pages, ~400 PB of data.
The hard parts: an inverted index that fits across thousands of machines, a query serving pipeline that aggregates shard responses in milliseconds, and a ranking system that blends lexical match, link authority, and learned relevance.
⚡ Core: Inverted Index + Ranking~100K QPS< 200 ms p99100B+ pages indexedCrawl → Index → Serve pipeline
02
Requirements
Functional
Crawl the public web continuously; discover new pages, refresh existing
Build an inverted index: term → docs containing that term
Main search endpoint. Returns { results: [{url, title, snippet, rank, cached_link}], total_hits, spell_suggestion }. Latency budget: 150 ms shard query + 30 ms aggregation + 20 ms snippet rendering.
GET/suggest?q=PREFIX
Autocomplete. Served from a separate trie-based suggestion service for < 30 ms p99. Completions come from query logs, not the main index.
GET/images?q=TERM
Image search. Separate vertical — different index (CLIP embeddings + text), different ranking, served by a parallel pipeline.
POST/crawler/submit-urlinternal
Sitemap / webmaster API for site owners to hint "crawl this URL." Goes into the crawl frontier priority queue.
POST/click-feedbackinternal
Logs user clicks + dwell time per (query, url) pair. This is the ranking model's training signal. ~9B clicks/day is a massive ML dataset.
05
Architecture
Three independent pipelines: crawl (continuous), index build (periodic batch), query serving (real-time). They communicate via large shared data stores — the crawled page corpus and the inverted index.
Crawl → Index → Serve pipelineSVG
Request Flow — Step Through
Query · user types→Parser · normalize + expand→Cache · query-result LRU→Aggregator · scatter to shards→Index shards · BM25 local top-100→Reranker · BERT top-1000 → 10→Snippets · render + return
Click Next Step to walk through the request flow.
06
Deep Dive — Inverted Index + Query Serving
An inverted index maps each term to the list of documents containing it — the posting list. Query "climate change" becomes: fetch posting list for "climate", fetch for "change", intersect, score, return top-K. The whole design is shaped around making this fast at 100K QPS over 100B pages.
Sharding choice. Two strategies:
Document-sharded (Google's choice): each shard holds a subset of docs + the posting lists for those docs. A query scatter-gathers to every shard; each returns local top-K; aggregator merges to global top-K. Scales well to many QPS; wastes some work because every query hits every shard.
Term-sharded: each shard owns a subset of terms' full posting lists. Query routes only to the shards holding the queried terms. Very efficient per query — but hot terms ("the", "COVID") create pathological load; rebalancing is painful. Used for specialty verticals only.
Query pipeline, in order of execution:
Parse + normalize. Lowercase, stemming, stop-word handling, query expansion (synonyms, abbreviations). Spell-correction against a dictionary + query-log bigram model.
Scatter to all index shards. Aggregator sends query to ~1000 shards in parallel. Each shard returns its local top-100 with scores.
Shard-local scoring (first-pass ranker). Classic BM25 or similar. Cheap; runs over millions of candidate docs per shard.
Aggregator merges to top-1000. Heap-merge, take global top-1000 by BM25.
Reranker (second-pass). A heavy ML model (e.g., BERT-based cross-encoder) scores the top-1000 with query+doc pairs. 100× more expensive per doc but only on a small set.
Snippet generation. Extract sentence(s) best matching the query from each of the top-10 docs. This is lightweight but user-visible; lives on the critical path.
Query latency budget breakdownMermaid
sequenceDiagram
participant U as User
participant AG as Aggregator
participant S1 as Shard 1..N
participant RR as Reranker
participant RS as Response
U->>AG: query "climate change 2026"
Note over AG: parse + expand ~5 ms
AG->>S1: scatter to ~1000 shards
S1-->>AG: shard top-100 + BM25 ~50 ms (parallel)
Note over AG: merge to global top-1000 ~10 ms
AG->>RR: top-1000 with query
RR-->>AG: top-10 ML-reranked ~80 ms (BERT batched)
AG->>RS: snippet gen + render ~20 ms
RS-->>U: results total ~165 ms p50
Caching. A massive query cache sits in front of the whole pipeline. The long tail is huge, but the head is tiny — a surprising fraction of queries are repeats within minutes. A simple LRU on normalized query strings serves ~30–40% of traffic without touching the index at all. Result cache TTL is short (minutes) to preserve freshness.
Ranking signals. The modern ranker combines:
Lexical (BM25). Term frequency × inverse document frequency. Still the baseline.
Link authority (PageRank variants). Computed offline weekly over the web graph.
User signals. Click-through rate, dwell time, bounce rate per (query, doc) pair — trained into the reranker.
Semantic relevance (neural). Dense vector similarity between query and doc embeddings. Retrieves semantically-matching docs that don't share exact terms. Google's "neural matching" + BERT for reranking.
Interview answer
"Document-sharded inverted index across ~1000 shards. Query aggregator scatters to all shards in parallel, each returns local top-100 using BM25, aggregator merges to top-1000, then a BERT-based cross-encoder reranks top-1000 down to the final top-10. Snippets are generated on the critical path. A query result cache absorbs ~30% of traffic. PageRank runs offline weekly. Budget: 150 ms shard lookup + 80 ms rerank + 20 ms snippet = under 250 ms p99."
07
Tradeoffs & Design Choices
Scatter to all shards vs smart routing. Scatter wastes work but is resilient and uniform. Routing only to "relevant" shards saves work but requires shard-term statistics everywhere. Google uses scatter — CPU is cheap relative to the ops complexity of routing.
Two-stage ranking. BM25 is cheap but dumb; BERT is smart but 100× slower. Two-stage ranking (cheap over millions → expensive over thousands) is the standard modern answer. Single-stage BERT over all docs would be 1000× too slow.
Batch index rebuild vs real-time updates. Full index is rebuilt nightly via MapReduce. Fresh content (news, social) goes into a small "live index" served in parallel and merged at query time. Two-index architecture hits both scale and freshness.
Inverted index in RAM vs on disk. Hot posting lists in RAM (top N terms cover most queries); cold on SSD. Disk-backed portions use skip pointers + delta-compressed posting lists to stream fast.
Personalization. Pure relevance is simpler; personalization improves user satisfaction but creates filter-bubble complaints and makes ranking harder to reason about. Google personalizes lightly (geo, recent queries, signed-in history).
08
Failure Modes
🐢
Straggler shard blows the p99
Aggregator waits for the slowest of 1000 shards. Even with fast average, tail latency dominates: p99(slowest) across 1000 ≈ very bad.
→ Mitigation: hedged requests — after 95 ms on a shard, send a second request to a replica; use whichever returns first. Also: request cancellation; ignore shards that miss the 150 ms budget and serve the partial result.
🕷️
Crawler stuck on infinite URL loops
Calendar sites with "?date=next" links, search-result pages linking to more search-results → infinite crawl frontier.
→ Mitigation: URL normalization, SimHash of rendered content, per-host budget caps, hop-count limits, ML spam classifier on URL patterns.
🌊
News event doubles QPS in 5 minutes
Earthquake happens; queries spike 3× on "earthquake" + local queries. Aggregators saturate.
→ Mitigation: pre-provisioned headroom, autoscaler for aggregator tier, aggressive result-cache TTL for trending queries (thundering herd prevention), degraded mode that skips the reranker under load.
📅
Stale index after big news
Fresh page published about breaking news; doesn't show up in results for hours because batch index rebuilds on daily cadence.
→ Mitigation: real-time freshness layer — separate inverted index fed from a high-priority crawl bucket, queryable within minutes. Merged with the main index at aggregator time.
🛡️
SEO spam / adversarial content
Sites keyword-stuff to game BM25 / inject invisible backlinks for PageRank. Index quality degrades.
→ Mitigation: spam classifier in crawl pipeline rejects obvious spam; ranking model trained on user engagement naturally demotes low-quality content; manual review for egregious cases ("demote domain").
🔒
Query log privacy incident
Query logs are ranking gold but massively sensitive. Leak = disaster.
→ Mitigation: per-user anonymization after N days, aggregate statistics only at query-signal level, strict internal access controls, differential-privacy noise on long-tail query stats.
⚠
Anti-patterns
🚫
Grep-style linear scan over all docs
100B pages × 100 KB = you're done.
✓ Better: Inverted index per term; posting lists; scatter-gather across document-sharded index.
🚫
One giant inverted index on one machine
Doesn't fit; one query's miss kills everyone.
✓ Better: Document-sharded across ~1000 shards; each returns local top-K; aggregator merges.
🚫
Rerank all 100B results with BERT for every query
BERT at 100K QPS over 100B docs is impossible.
✓ Better: Two-stage: cheap BM25 over millions → heavy ML rerank over top-1000.
09
Interview Tips
Name the three pipelines. Crawl, index build, query serving. Many candidates conflate them or only discuss query serving. Show you know crawl is a separate beast.
Document-sharding is the default answer. If you say "term-sharded," be ready to defend the hot-term problem. Google uses document-sharded; most interviewers expect that.
Two-stage ranking. Explicitly call out: cheap first-pass over all candidates → expensive neural reranker over a small set. This is the modern pattern and shows you're current.
Query cache answers "how do you handle 100K QPS?" A big fraction of queries are repeats; a simple LRU slashes load by 30%+. Mention it early.
Fresh index + main index. If asked "how do you handle breaking news?" — two-index architecture is the answer, not "rebuild more often."
Crawl ~10M docs; build an on-disk inverted index; one HTTP server handles queries. Works for a niche vertical (e.g., "search this intranet"). Not the web.
2
Document-sharded index + aggregator
Shard by doc_id hash; scatter-gather across shards. BM25 on each shard. This is pre-PageRank search (AltaVista era).
3
PageRank-enhanced ranking
Batch-compute PageRank over the link graph; factor authority into ranking. This is the original Google advantage (1998).
4
Learning-to-rank with click signals
User click + dwell data trains a rerank model. Ranker is now data-driven, not just heuristic. Google's RankBrain era (2015).
5
Neural retrieval + LLM answers
Dense vector retrieval for semantic matching; BERT/transformer rerankers. LLM-generated direct answers ("AI Overviews") on top of the ranked list. Modern Google (2023+).