Exercise · Search & Discovery

Google Search — Web Search Engine

Whiteboard exercise. Try the problem cold, then reveal the rubric to self-score.

Out of 10 points45 min whiteboardReference solution →
01

Prompt

Design Google's web search. 100B pages indexed, 100K queries/sec, < 200ms p99 query latency. Crawl + index + rank.

Time budget: 45 min whiteboard. Draw architecture, estimate numbers, discuss tradeoffs.

02

Hints (progressive — click to reveal)

Hint 1

Three independent pipelines: crawl (continuous), index build (batch), query serving (real-time). Don't conflate.

Hint 2

Document-sharded inverted index across ~1000 shards. Scatter-gather at query time.

Hint 3

Two-stage ranking: cheap BM25 over millions → heavy BERT rerank over top-1000.

03

Rubric — 10 points

  • +1 BoE: 100B pages × 100KB HTML = 10PB raw; ~100M unique terms; ~400PB stored with derivatives
  • +2 Document-sharded inverted index; ~1000 shards; scatter-gather query with aggregator merge
  • +2 Two-stage ranking: BM25 first-pass over candidates → BERT/cross-encoder rerank over top 1000
  • +1 Query cache absorbs ~30% of traffic (repeated queries); LRU with short TTL
  • +1 Crawler: URL frontier + fetchers + parser + dedup (SimHash); polite per-host rate limits
  • +1 Separate fresh-index for breaking-news content; merged at query time with main index
  • +1 Snippet generation + spell-correction + query expansion on the critical path
  • +1 Hedged requests to handle straggler shards: duplicate slow requests after N ms

Self-score: tally the points you would have mentioned unprompted. 7+ is interview-ready on this problem.

04

Red flags (things that tank the interview)

  • Term-sharded index without discussing hot-term problem
  • One massive inverted index on a single machine
  • Real-time index updates (no batch pipeline)
  • Rerank all results with BERT (too slow)
  • Ignores crawling / indexing; jumps straight to query serving