Search Engine

A web-scale search engine — crawl the public web, build an inverted index, serve queries in under 200 ms with high-quality ranked results. Google handles ~100K queries per second over an index of hundreds of billions of pages, ~400 PB of data. The hard parts: an inverted index that fits across thousands of machines, a query serving pipeline that aggregates shard responses in milliseconds, and a ranking system that blends lexical match, link authority, and learned relevance.

⚡ Core: Inverted Index + Ranking~100K QPS< 200 ms p99100B+ pages indexedCrawl → Index → Serve pipeline
02

Requirements

Functional
  • Crawl the public web continuously; discover new pages, refresh existing
  • Build an inverted index: term → docs containing that term
  • Serve keyword queries, phrase queries, filters (site:, date, language)
  • Rank results by relevance — lexical score + authority + freshness + personalization
  • Return top-10 results with snippets + cached links in < 200 ms
  • Autocomplete / spell-correct / "did you mean"
Non-Functional
  • < 200 ms p99 query latency end-to-end
  • ~100K QPS sustained; burst 2× during news events
  • Index covers 100B+ pages; recrawl important pages every few minutes
  • 99.99% query availability — users notice outages in seconds
  • Eventual consistency: freshly-crawled pages appear in index within minutes-to-hours
  • Geo-distributed: queries served from nearest region
03

Scale Estimation

Pages indexed
~100B
estimated Google index; raw HTML ~100 KB avg = 10 PB raw → ~400 PB stored with all derivatives
Queries / sec
~100K
Google handles ~99K/sec global avg; ~9B/day
Unique terms
~100M
after normalization; inverted index has 100M posting lists
Posting list size
~100 TB
compressed; needs ~1000-way shard to fit in RAM-tier across cluster
Crawl rate
~100K pages/sec
~10B pages/day; polite crawl respects robots.txt per-host
Query fan-out
~1000 shards
each query scatter-gathers across every index shard; aggregator merges top-K
04

API Design

GET/search?q=TERM&start=0&count=10&lang=en&site=github.com

Main search endpoint. Returns { results: [{url, title, snippet, rank, cached_link}], total_hits, spell_suggestion }. Latency budget: 150 ms shard query + 30 ms aggregation + 20 ms snippet rendering.

GET/suggest?q=PREFIX

Autocomplete. Served from a separate trie-based suggestion service for < 30 ms p99. Completions come from query logs, not the main index.

GET/images?q=TERM

Image search. Separate vertical — different index (CLIP embeddings + text), different ranking, served by a parallel pipeline.

POST/crawler/submit-url internal

Sitemap / webmaster API for site owners to hint "crawl this URL." Goes into the crawl frontier priority queue.

POST/click-feedback internal

Logs user clicks + dwell time per (query, url) pair. This is the ranking model's training signal. ~9B clicks/day is a massive ML dataset.

05

Architecture

Three independent pipelines: crawl (continuous), index build (periodic batch), query serving (real-time). They communicate via large shared data stores — the crawled page corpus and the inverted index.

Crawl → Index → Serve pipeline SVG
Crawl pipeline URL Frontier priority queue Fetchers HTTP pool Parser DOM + links Dedup SimHash Raw corpus (Bigtable) ~10 PB HTML + metadata Index pipeline (batch) MapReduce indexer tokenize + invert PageRank job weekly; Pregel Ranker trainer LTR / BERT Snippet gen sentence scoring Inverted index (sharded) ~1000 shards × ~100 GB each Query serving (hot path) User query Query parser normalize + expand Aggregator scatter/gather Shard 1 Shard 2 … N Reranker BERT / LTR Response snippets + render
Request Flow — Step Through
Query · user typesParser · normalize + expandCache · query-result LRUAggregator · scatter to shardsIndex shards · BM25 local top-100Reranker · BERT top-1000 → 10Snippets · render + return
Click Next Step to walk through the request flow.
06

Deep Dive — Inverted Index + Query Serving

An inverted index maps each term to the list of documents containing it — the posting list. Query "climate change" becomes: fetch posting list for "climate", fetch for "change", intersect, score, return top-K. The whole design is shaped around making this fast at 100K QPS over 100B pages.

Sharding choice. Two strategies:

  • Document-sharded (Google's choice): each shard holds a subset of docs + the posting lists for those docs. A query scatter-gathers to every shard; each returns local top-K; aggregator merges to global top-K. Scales well to many QPS; wastes some work because every query hits every shard.
  • Term-sharded: each shard owns a subset of terms' full posting lists. Query routes only to the shards holding the queried terms. Very efficient per query — but hot terms ("the", "COVID") create pathological load; rebalancing is painful. Used for specialty verticals only.

Query pipeline, in order of execution:

  1. Parse + normalize. Lowercase, stemming, stop-word handling, query expansion (synonyms, abbreviations). Spell-correction against a dictionary + query-log bigram model.
  2. Scatter to all index shards. Aggregator sends query to ~1000 shards in parallel. Each shard returns its local top-100 with scores.
  3. Shard-local scoring (first-pass ranker). Classic BM25 or similar. Cheap; runs over millions of candidate docs per shard.
  4. Aggregator merges to top-1000. Heap-merge, take global top-1000 by BM25.
  5. Reranker (second-pass). A heavy ML model (e.g., BERT-based cross-encoder) scores the top-1000 with query+doc pairs. 100× more expensive per doc but only on a small set.
  6. Snippet generation. Extract sentence(s) best matching the query from each of the top-10 docs. This is lightweight but user-visible; lives on the critical path.
Query latency budget breakdown Mermaid
sequenceDiagram participant U as User participant AG as Aggregator participant S1 as Shard 1..N participant RR as Reranker participant RS as Response U->>AG: query "climate change 2026" Note over AG: parse + expand
~5 ms AG->>S1: scatter to ~1000 shards S1-->>AG: shard top-100 + BM25
~50 ms (parallel) Note over AG: merge to global top-1000
~10 ms AG->>RR: top-1000 with query RR-->>AG: top-10 ML-reranked
~80 ms (BERT batched) AG->>RS: snippet gen + render
~20 ms RS-->>U: results
total ~165 ms p50

Caching. A massive query cache sits in front of the whole pipeline. The long tail is huge, but the head is tiny — a surprising fraction of queries are repeats within minutes. A simple LRU on normalized query strings serves ~30–40% of traffic without touching the index at all. Result cache TTL is short (minutes) to preserve freshness.

Ranking signals. The modern ranker combines:

  • Lexical (BM25). Term frequency × inverse document frequency. Still the baseline.
  • Link authority (PageRank variants). Computed offline weekly over the web graph.
  • Freshness. News-y queries favor recent docs; evergreen queries don't.
  • User signals. Click-through rate, dwell time, bounce rate per (query, doc) pair — trained into the reranker.
  • Semantic relevance (neural). Dense vector similarity between query and doc embeddings. Retrieves semantically-matching docs that don't share exact terms. Google's "neural matching" + BERT for reranking.
Interview answer

"Document-sharded inverted index across ~1000 shards. Query aggregator scatters to all shards in parallel, each returns local top-100 using BM25, aggregator merges to top-1000, then a BERT-based cross-encoder reranks top-1000 down to the final top-10. Snippets are generated on the critical path. A query result cache absorbs ~30% of traffic. PageRank runs offline weekly. Budget: 150 ms shard lookup + 80 ms rerank + 20 ms snippet = under 250 ms p99."

07

Tradeoffs & Design Choices

  • Scatter to all shards vs smart routing. Scatter wastes work but is resilient and uniform. Routing only to "relevant" shards saves work but requires shard-term statistics everywhere. Google uses scatter — CPU is cheap relative to the ops complexity of routing.
  • Two-stage ranking. BM25 is cheap but dumb; BERT is smart but 100× slower. Two-stage ranking (cheap over millions → expensive over thousands) is the standard modern answer. Single-stage BERT over all docs would be 1000× too slow.
  • Batch index rebuild vs real-time updates. Full index is rebuilt nightly via MapReduce. Fresh content (news, social) goes into a small "live index" served in parallel and merged at query time. Two-index architecture hits both scale and freshness.
  • Inverted index in RAM vs on disk. Hot posting lists in RAM (top N terms cover most queries); cold on SSD. Disk-backed portions use skip pointers + delta-compressed posting lists to stream fast.
  • Personalization. Pure relevance is simpler; personalization improves user satisfaction but creates filter-bubble complaints and makes ranking harder to reason about. Google personalizes lightly (geo, recent queries, signed-in history).
08

Failure Modes

🐢
Straggler shard blows the p99
Aggregator waits for the slowest of 1000 shards. Even with fast average, tail latency dominates: p99(slowest) across 1000 ≈ very bad.
→ Mitigation: hedged requests — after 95 ms on a shard, send a second request to a replica; use whichever returns first. Also: request cancellation; ignore shards that miss the 150 ms budget and serve the partial result.
🕷️
Crawler stuck on infinite URL loops
Calendar sites with "?date=next" links, search-result pages linking to more search-results → infinite crawl frontier.
→ Mitigation: URL normalization, SimHash of rendered content, per-host budget caps, hop-count limits, ML spam classifier on URL patterns.
🌊
News event doubles QPS in 5 minutes
Earthquake happens; queries spike 3× on "earthquake" + local queries. Aggregators saturate.
→ Mitigation: pre-provisioned headroom, autoscaler for aggregator tier, aggressive result-cache TTL for trending queries (thundering herd prevention), degraded mode that skips the reranker under load.
📅
Stale index after big news
Fresh page published about breaking news; doesn't show up in results for hours because batch index rebuilds on daily cadence.
→ Mitigation: real-time freshness layer — separate inverted index fed from a high-priority crawl bucket, queryable within minutes. Merged with the main index at aggregator time.
🛡️
SEO spam / adversarial content
Sites keyword-stuff to game BM25 / inject invisible backlinks for PageRank. Index quality degrades.
→ Mitigation: spam classifier in crawl pipeline rejects obvious spam; ranking model trained on user engagement naturally demotes low-quality content; manual review for egregious cases ("demote domain").
🔒
Query log privacy incident
Query logs are ranking gold but massively sensitive. Leak = disaster.
→ Mitigation: per-user anonymization after N days, aggregate statistics only at query-signal level, strict internal access controls, differential-privacy noise on long-tail query stats.

Anti-patterns

🚫
Grep-style linear scan over all docs

100B pages × 100 KB = you're done.

✓ Better: Inverted index per term; posting lists; scatter-gather across document-sharded index.
🚫
One giant inverted index on one machine

Doesn't fit; one query's miss kills everyone.

✓ Better: Document-sharded across ~1000 shards; each returns local top-K; aggregator merges.
🚫
Rerank all 100B results with BERT for every query

BERT at 100K QPS over 100B docs is impossible.

✓ Better: Two-stage: cheap BM25 over millions → heavy ML rerank over top-1000.
09

Interview Tips

  1. Name the three pipelines. Crawl, index build, query serving. Many candidates conflate them or only discuss query serving. Show you know crawl is a separate beast.
  2. Document-sharding is the default answer. If you say "term-sharded," be ready to defend the hot-term problem. Google uses document-sharded; most interviewers expect that.
  3. Two-stage ranking. Explicitly call out: cheap first-pass over all candidates → expensive neural reranker over a small set. This is the modern pattern and shows you're current.
  4. Query cache answers "how do you handle 100K QPS?" A big fraction of queries are repeats; a simple LRU slashes load by 30%+. Mention it early.
  5. Fresh index + main index. If asked "how do you handle breaking news?" — two-index architecture is the answer, not "rebuild more often."
11

Evolution

1

MVP — single-machine Lucene

Crawl ~10M docs; build an on-disk inverted index; one HTTP server handles queries. Works for a niche vertical (e.g., "search this intranet"). Not the web.

2

Document-sharded index + aggregator

Shard by doc_id hash; scatter-gather across shards. BM25 on each shard. This is pre-PageRank search (AltaVista era).

3

PageRank-enhanced ranking

Batch-compute PageRank over the link graph; factor authority into ranking. This is the original Google advantage (1998).

4

Learning-to-rank with click signals

User click + dwell data trains a rerank model. Ranker is now data-driven, not just heuristic. Google's RankBrain era (2015).

5

Neural retrieval + LLM answers

Dense vector retrieval for semantic matching; BERT/transformer rerankers. LLM-generated direct answers ("AI Overviews") on top of the ranked list. Modern Google (2023+).

Next up