Exercise · Search & Discovery
Google Search — Web Search Engine
Whiteboard exercise. Try the problem cold, then reveal the rubric to self-score.
Out of 10 points45 min whiteboardReference solution →
01
Prompt
Design Google's web search. 100B pages indexed, 100K queries/sec, < 200ms p99 query latency. Crawl + index + rank.
Time budget: 45 min whiteboard. Draw architecture, estimate numbers, discuss tradeoffs.
02
Hints (progressive — click to reveal)
Hint 1
Three independent pipelines: crawl (continuous), index build (batch), query serving (real-time). Don't conflate.
Hint 2
Document-sharded inverted index across ~1000 shards. Scatter-gather at query time.
Hint 3
Two-stage ranking: cheap BM25 over millions → heavy BERT rerank over top-1000.
03
Rubric — 10 points
- +1 BoE: 100B pages × 100KB HTML = 10PB raw; ~100M unique terms; ~400PB stored with derivatives
- +2 Document-sharded inverted index; ~1000 shards; scatter-gather query with aggregator merge
- +2 Two-stage ranking: BM25 first-pass over candidates → BERT/cross-encoder rerank over top 1000
- +1 Query cache absorbs ~30% of traffic (repeated queries); LRU with short TTL
- +1 Crawler: URL frontier + fetchers + parser + dedup (SimHash); polite per-host rate limits
- +1 Separate fresh-index for breaking-news content; merged at query time with main index
- +1 Snippet generation + spell-correction + query expansion on the critical path
- +1 Hedged requests to handle straggler shards: duplicate slow requests after N ms
Self-score: tally the points you would have mentioned unprompted. 7+ is interview-ready on this problem.
04
Red flags (things that tank the interview)
- Term-sharded index without discussing hot-term problem
- One massive inverted index on a single machine
- Real-time index updates (no batch pipeline)
- Rerank all results with BERT (too slow)
- Ignores crawling / indexing; jumps straight to query serving