Exercise · Search & Discovery

Google Search — Web Search Engine

Whiteboard exercise. Try the problem cold, then reveal the rubric to self-score.

Out of 10 points45 min whiteboardReference solution →

Prompt

Design Google's web search. 100B pages indexed, 100K queries/sec, < 200ms p99 query latency. Crawl + index + rank.

Time budget: 45 min whiteboard. Draw architecture, estimate numbers, discuss tradeoffs.

Hint 1

Three independent pipelines: crawl (continuous), index build (batch), query serving (real-time). Don't conflate.

Hint 2

Document-sharded inverted index across ~1000 shards. Scatter-gather at query time.

Hint 3

Two-stage ranking: cheap BM25 over millions → heavy BERT rerank over top-1000.

+1 BoE: 100B pages × 100KB HTML = 10PB raw; ~100M unique terms; ~400PB stored with derivatives
+2 Document-sharded inverted index; ~1000 shards; scatter-gather query with aggregator merge
+2 Two-stage ranking: BM25 first-pass over candidates → BERT/cross-encoder rerank over top 1000
+1 Query cache absorbs ~30% of traffic (repeated queries); LRU with short TTL
+1 Crawler: URL frontier + fetchers + parser + dedup (SimHash); polite per-host rate limits
+1 Separate fresh-index for breaking-news content; merged at query time with main index
+1 Snippet generation + spell-correction + query expansion on the critical path
+1 Hedged requests to handle straggler shards: duplicate slow requests after N ms

Self-score: tally the points you would have mentioned unprompted. 7+ is interview-ready on this problem.