Search Engine

A web-scale search engine — crawl the public web, build an inverted index, serve queries in under 200 ms with high-quality ranked results. Google handles ~100K queries per second over an index of hundreds of billions of pages, ~400 PB of data. The hard parts: an inverted index that fits across thousands of machines, a query serving pipeline that aggregates shard responses in milliseconds, and a ranking system that blends lexical match, link authority, and learned relevance.

⚡ Core: Inverted Index + Ranking~100K QPS< 200 ms p99100B+ pages indexedCrawl → Index → Serve pipeline

Requirements

Functional

Crawl the public web continuously; discover new pages, refresh existing
Build an inverted index: term → docs containing that term
Serve keyword queries, phrase queries, filters (site:, date, language)
Rank results by relevance — lexical score + authority + freshness + personalization
Return top-10 results with snippets + cached links in < 200 ms
Autocomplete / spell-correct / "did you mean"

Non-Functional

< 200 ms p99 query latency end-to-end
~100K QPS sustained; burst 2× during news events
Index covers 100B+ pages; recrawl important pages every few minutes
99.99% query availability — users notice outages in seconds
Eventual consistency: freshly-crawled pages appear in index within minutes-to-hours
Geo-distributed: queries served from nearest region

Scale Estimation

Pages indexed

~100B

estimated Google index; raw HTML ~100 KB avg = 10 PB raw → ~400 PB stored with all derivatives

Queries / sec

~100K

Google handles ~99K/sec global avg; ~9B/day

Unique terms

~100M

after normalization; inverted index has 100M posting lists

Posting list size

~100 TB

compressed; needs ~1000-way shard to fit in RAM-tier across cluster

Crawl rate

~100K pages/sec

~10B pages/day; polite crawl respects robots.txt per-host

Query fan-out

~1000 shards

each query scatter-gathers across every index shard; aggregator merges top-K

API Design

GET/search?q=TERM&start=0&count=10&lang=en&site=github.com

Main search endpoint. Returns { results: [{url, title, snippet, rank, cached_link}], total_hits, spell_suggestion }. Latency budget: 150 ms shard query + 30 ms aggregation + 20 ms snippet rendering.

GET/suggest?q=PREFIX

Autocomplete. Served from a separate trie-based suggestion service for < 30 ms p99. Completions come from query logs, not the main index.

GET/images?q=TERM

Image search. Separate vertical — different index (CLIP embeddings + text), different ranking, served by a parallel pipeline.

POST/crawler/submit-url internal

Sitemap / webmaster API for site owners to hint "crawl this URL." Goes into the crawl frontier priority queue.

POST/click-feedback internal

Logs user clicks + dwell time per (query, url) pair. This is the ranking model's training signal. ~9B clicks/day is a massive ML dataset.

Architecture

Three independent pipelines: crawl (continuous), index build (periodic batch), query serving (real-time). They communicate via large shared data stores — the crawled page corpus and the inverted index.

Crawl → Index → Serve pipeline SVG

Request Flow — Step Through

Query · user types→Parser · normalize + expand→Cache · query-result LRU→Aggregator · scatter to shards→Index shards · BM25 local top-100→Reranker · BERT top-1000 → 10→Snippets · render + return

Click Next Step to walk through the request flow.

Deep Dive — Inverted Index + Query Serving

An inverted index maps each term to the list of documents containing it — the posting list. Query "climate change" becomes: fetch posting list for "climate", fetch for "change", intersect, score, return top-K. The whole design is shaped around making this fast at 100K QPS over 100B pages.

Sharding choice. Two strategies:

Document-sharded (Google's choice): each shard holds a subset of docs + the posting lists for those docs. A query scatter-gathers to every shard; each returns local top-K; aggregator merges to global top-K. Scales well to many QPS; wastes some work because every query hits every shard.
Term-sharded: each shard owns a subset of terms' full posting lists. Query routes only to the shards holding the queried terms. Very efficient per query — but hot terms ("the", "COVID") create pathological load; rebalancing is painful. Used for specialty verticals only.

Query pipeline, in order of execution:

Parse + normalize. Lowercase, stemming, stop-word handling, query expansion (synonyms, abbreviations). Spell-correction against a dictionary + query-log bigram model.
Scatter to all index shards. Aggregator sends query to ~1000 shards in parallel. Each shard returns its local top-100 with scores.
Shard-local scoring (first-pass ranker). Classic BM25 or similar. Cheap; runs over millions of candidate docs per shard.
Aggregator merges to top-1000. Heap-merge, take global top-1000 by BM25.
Reranker (second-pass). A heavy ML model (e.g., BERT-based cross-encoder) scores the top-1000 with query+doc pairs. 100× more expensive per doc but only on a small set.
Snippet generation. Extract sentence(s) best matching the query from each of the top-10 docs. This is lightweight but user-visible; lives on the critical path.

Query latency budget breakdown Mermaid

sequenceDiagram participant U as User participant AG as Aggregator participant S1 as Shard 1..N participant RR as Reranker participant RS as Response U->>AG: query "climate change 2026" Note over AG: parse + expand
~5 ms AG->>S1: scatter to ~1000 shards S1-->>AG: shard top-100 + BM25
~50 ms (parallel) Note over AG: merge to global top-1000
~10 ms AG->>RR: top-1000 with query RR-->>AG: top-10 ML-reranked
~80 ms (BERT batched) AG->>RS: snippet gen + render
~20 ms RS-->>U: results
total ~165 ms p50

Caching. A massive query cache sits in front of the whole pipeline. The long tail is huge, but the head is tiny — a surprising fraction of queries are repeats within minutes. A simple LRU on normalized query strings serves ~30–40% of traffic without touching the index at all. Result cache TTL is short (minutes) to preserve freshness.

Ranking signals. The modern ranker combines:

Lexical (BM25). Term frequency × inverse document frequency. Still the baseline.
Link authority (PageRank variants). Computed offline weekly over the web graph.
Freshness. News-y queries favor recent docs; evergreen queries don't.
User signals. Click-through rate, dwell time, bounce rate per (query, doc) pair — trained into the reranker.
Semantic relevance (neural). Dense vector similarity between query and doc embeddings. Retrieves semantically-matching docs that don't share exact terms. Google's "neural matching" + BERT for reranking.

Interview answer

"Document-sharded inverted index across ~1000 shards. Query aggregator scatters to all shards in parallel, each returns local top-100 using BM25, aggregator merges to top-1000, then a BERT-based cross-encoder reranks top-1000 down to the final top-10. Snippets are generated on the critical path. A query result cache absorbs ~30% of traffic. PageRank runs offline weekly. Budget: 150 ms shard lookup + 80 ms rerank + 20 ms snippet = under 250 ms p99."

Tradeoffs & Design Choices

Scatter to all shards vs smart routing. Scatter wastes work but is resilient and uniform. Routing only to "relevant" shards saves work but requires shard-term statistics everywhere. Google uses scatter — CPU is cheap relative to the ops complexity of routing.
Two-stage ranking. BM25 is cheap but dumb; BERT is smart but 100× slower. Two-stage ranking (cheap over millions → expensive over thousands) is the standard modern answer. Single-stage BERT over all docs would be 1000× too slow.
Batch index rebuild vs real-time updates. Full index is rebuilt nightly via MapReduce. Fresh content (news, social) goes into a small "live index" served in parallel and merged at query time. Two-index architecture hits both scale and freshness.
Inverted index in RAM vs on disk. Hot posting lists in RAM (top N terms cover most queries); cold on SSD. Disk-backed portions use skip pointers + delta-compressed posting lists to stream fast.
Personalization. Pure relevance is simpler; personalization improves user satisfaction but creates filter-bubble complaints and makes ranking harder to reason about. Google personalizes lightly (geo, recent queries, signed-in history).

Failure Modes

🐢

Straggler shard blows the p99

Aggregator waits for the slowest of 1000 shards. Even with fast average, tail latency dominates: p99(slowest) across 1000 ≈ very bad.

→ Mitigation: hedged requests — after 95 ms on a shard, send a second request to a replica; use whichever returns first. Also: request cancellation; ignore shards that miss the 150 ms budget and serve the partial result.

🕷️

Crawler stuck on infinite URL loops

Calendar sites with "?date=next" links, search-result pages linking to more search-results → infinite crawl frontier.

→ Mitigation: URL normalization, SimHash of rendered content, per-host budget caps, hop-count limits, ML spam classifier on URL patterns.

🌊

News event doubles QPS in 5 minutes

Earthquake happens; queries spike 3× on "earthquake" + local queries. Aggregators saturate.

→ Mitigation: pre-provisioned headroom, autoscaler for aggregator tier, aggressive result-cache TTL for trending queries (thundering herd prevention), degraded mode that skips the reranker under load.

📅

Stale index after big news

Fresh page published about breaking news; doesn't show up in results for hours because batch index rebuilds on daily cadence.

→ Mitigation: real-time freshness layer — separate inverted index fed from a high-priority crawl bucket, queryable within minutes. Merged with the main index at aggregator time.

🛡️

SEO spam / adversarial content

Sites keyword-stuff to game BM25 / inject invisible backlinks for PageRank. Index quality degrades.

→ Mitigation: spam classifier in crawl pipeline rejects obvious spam; ranking model trained on user engagement naturally demotes low-quality content; manual review for egregious cases ("demote domain").

🔒

Query log privacy incident

Query logs are ranking gold but massively sensitive. Leak = disaster.

→ Mitigation: per-user anonymization after N days, aggregate statistics only at query-signal level, strict internal access controls, differential-privacy noise on long-tail query stats.

⚠

Anti-patterns

🚫

Grep-style linear scan over all docs

100B pages × 100 KB = you're done.

✓ Better: Inverted index per term; posting lists; scatter-gather across document-sharded index.

🚫

One giant inverted index on one machine

Doesn't fit; one query's miss kills everyone.

✓ Better: Document-sharded across ~1000 shards; each returns local top-K; aggregator merges.

🚫

Rerank all 100B results with BERT for every query

BERT at 100K QPS over 100B docs is impossible.

✓ Better: Two-stage: cheap BM25 over millions → heavy ML rerank over top-1000.

Interview Tips

Name the three pipelines. Crawl, index build, query serving. Many candidates conflate them or only discuss query serving. Show you know crawl is a separate beast.
Document-sharding is the default answer. If you say "term-sharded," be ready to defend the hot-term problem. Google uses document-sharded; most interviewers expect that.
Two-stage ranking. Explicitly call out: cheap first-pass over all candidates → expensive neural reranker over a small set. This is the modern pattern and shows you're current.
Query cache answers "how do you handle 100K QPS?" A big fraction of queries are repeats; a simple LRU slashes load by 30%+. Mention it early.
Fresh index + main index. If asked "how do you handle breaking news?" — two-index architecture is the answer, not "rebuild more often."

Evolution

MVP — single-machine Lucene

Crawl ~10M docs; build an on-disk inverted index; one HTTP server handles queries. Works for a niche vertical (e.g., "search this intranet"). Not the web.

Document-sharded index + aggregator

Shard by doc_id hash; scatter-gather across shards. BM25 on each shard. This is pre-PageRank search (AltaVista era).

PageRank-enhanced ranking

Batch-compute PageRank over the link graph; factor authority into ranking. This is the original Google advantage (1998).

Learning-to-rank with click signals

User click + dwell data trains a rerank model. Ranker is now data-driven, not just heuristic. Google's RankBrain era (2015).

Neural retrieval + LLM answers

Dense vector retrieval for semantic matching; BERT/transformer rerankers. LLM-generated direct answers ("AI Overviews") on top of the ranked list. Modern Google (2023+).

📺

References & Videos

How Google Search Works

ByteByteGo · 10 min

Design Google Search

Arpit Bhayani · 35 min

System Design: Search Engine

Hussein Nasser · 22 min

Design a Web Search Engine

Next up

Web Crawler

The crawl pipeline in isolation

Read →

PROBLEM

Typeahead Suggestions

Autocomplete / "did you mean" — query-log-driven

Read →

Search Engine

Requirements

Scale Estimation

API Design

Architecture

Deep Dive — Inverted Index + Query Serving

Tradeoffs & Design Choices

Failure Modes

Anti-patterns

Interview Tips

Similar Problems

Web Crawler

Typeahead Suggestions

Recommendation Algorithm

Leaderboard

Yelp / Google Places

Evolution

MVP — single-machine Lucene

Document-sharded index + aggregator

PageRank-enhanced ranking

Learning-to-rank with click signals

Neural retrieval + LLM answers

References & Videos

Web Crawler

Typeahead Suggestions