A system that automatically traverses the web — starting from seed URLs, fetching pages, extracting links, and repeating — to build a searchable index of internet content. Deceptively simple at small scale; one of the hardest distributed systems problems at Google scale.
Given seed URLs, crawl and index reachable web pages
Extract and follow all hyperlinks discovered on each page
Store raw HTML and parsed content for indexing
Detect and skip duplicate content — exact and near-duplicate
Respect robots.txt and per-domain crawl delays
Re-crawl pages periodically to keep index fresh
Non-Functional
Crawl 10 billion pages within 30 days
Prioritise high-authority pages — crawl important content first
Polite: max 1 req/sec per domain unless specified otherwise
Resilient — worker failures must not lose URLs or duplicate work
Horizontally scalable — add machines to increase throughput
Extensible — support multiple downstream consumers (search, ML)
Scope Decision
We are building the crawler pipeline only — not the search index, ranking algorithm, or query serving layer. Our output is a content store of fetched pages with metadata. Indexing and ranking are downstream consumers of our output.
03
Scale Estimation
Metric
Calculation
Result
Target pages
Industry benchmark — Google's index
10 billion
Crawl window
Requirement
30 days
Required crawl rate
10B ÷ (30 × 86,400s)
~3,900 pages/sec
Average page size
Typical HTML page incl. inline resources
~100 KB
Inbound bandwidth
3,900 × 100 KB
~390 MB/s = 3 Gbps
Raw HTML storage (5yr)
10B × 100 KB
~1 PB
URL frontier size
10B URLs × 100 bytes metadata
~1 TB
Bloom filter memory
10B × 10 bits (1% false positive rate)
~12 GB RAM
Key Insight
A single machine fetching 2 pages/sec (500ms network round-trip) would take 158 years to crawl 10B pages. The fundamental challenge is not algorithmic complexity — it is I/O-bound parallelism at massive scale. Every architectural decision flows from this constraint.
04
API Design
A crawler is primarily an internal system — its "API" is the interface between its own components, plus a control plane for operators. External consumers (search indexers, ML pipelines) read from the content store directly.
POST/v1/crawl/seedsSubmit seed URLs to start a crawl
POST/v1/domains/:domain/blockOperator override — block a domain
Returns 204 No Content. Immediately stops all pending fetches for this domain. Used for legal compliance, abuse prevention, or site owner requests.
05
High-Level Architecture
Architecture — Full Crawler PipelineSVG Diagram
URL Frontier Service
The brain of the system. Maintains the priority queue of URLs to crawl, runs Bloom filter dedup, tracks per-domain crawl budgets, and schedules recrawls. Workers pull batches of 100 URLs — not one at a time — to reduce coordination overhead.
Fetcher Fleet
I/O-bound workers that make HTTP requests. Each machine maintains thousands of async connections simultaneously. Uses domain affinity — each domain is owned by one machine — eliminating cross-machine coordination for politeness.
Kafka (raw_pages)
Decouples fetching from parsing. Fetchers produce pages at variable rates; parsers consume at their own pace. Provides durability — a crashed parser doesn't lose work, the message is redelivered. Handles backpressure automatically.
Parser Fleet
CPU-bound workers. Strip boilerplate, extract links, compute SimHash for near-duplicate detection, extract metadata. Scaled independently from fetchers — different hardware profile, different bottleneck (CPU vs network).
06
Deep Dive — The Deduplication Stack
Why This Is The Core Problem
The web is ~60–70% duplicate content by some estimates — syndicated news, scraped pages, boilerplate-wrapped articles, URL parameter variants. Without a rigorous dedup stack, most of your crawl budget is wasted on content you've already seen. Every layer catches a different class of duplicate.
Every URL passes through three deduplication layers before it enters the index. Each layer is faster and cheaper than the next, so most duplicates are caught early:
L1URL NormalizationStrip tracking params, session IDs, fragments. ?utm_source=twitter → removed. Canonical form before any lookup.~0ms · no storage
L2Bloom FilterCheck if normalized URL has been seen. 12GB RAM for 10B URLs. 1% false positive rate — occasionally skips a new URL, never re-crawls a seen one.<1ms · 12 GB RAM
L3Content Hash (MD5)After fetching, hash the raw HTML. Exact byte-for-byte duplicate detection. Catches mirrors and identical copies at different URLs.<1ms · 16B per page
L4SimHash (Near-Duplicate)Compute 64-bit SimHash of page content. Find all stored hashes within Hamming distance ≤ 3. Catches syndicated articles, scraped content, boilerplate variations.~2ms · 8B per page
Add a URL to mark it as seen, then check it again to see the filter in action.
Sequence — URL Discovery to IndexMermaid.js
sequenceDiagram
participant P as Parser
participant BF as Bloom Filter
participant F as Frontier
participant FT as Fetcher
participant SH as SimHash Store
participant DB as Content Store
P->>P: Extract link from page
P->>P: Normalize URL (strip tracking params)
P->>BF: Has this URL been seen?
alt Already seen
BF-->>P: YES — skip (or 1% false positive)
else New URL
BF-->>P: NO — proceed
P->>F: Add URL with priority score
F->>FT: Dispatch URL batch (100 at a time)
FT->>FT: Check robots.txt cache
FT->>FT: Respect crawl-delay
FT->>FT: HTTP GET with ETag/If-Modified-Since
alt 304 Not Modified
FT-->>F: Content unchanged — skip re-index
else 200 OK
FT->>BF: Mark URL as seen
FT->>SH: Check SimHash (near-dup?)
alt Near-duplicate found
SH-->>FT: Hamming distance ≤ 3 — skip
else Unique content
FT->>DB: Store raw HTML + metadata
FT->>P: Emit to Kafka for parsing
end
end
end
The SimHash lookup at scale deserves special attention. With 10 billion stored hashes, naively comparing a new hash against all of them would be impossibly slow. The trick exploits a mathematical property: if two 64-bit hashes differ in at most 3 bits, and you split the hash into 4 chunks of 16 bits, at least one chunk must be identical (by the pigeonhole principle — 3 flips across 4 buckets leaves one untouched).
So you maintain 4 hash tables, each indexed by a different 16-bit chunk. A near-duplicate lookup becomes 4 hash table lookups instead of 10 billion comparisons. This is the same paper Google published in 2007 describing how they detect near-duplicates at web scale.
Request Flow — Step Through
URL Frontier · Priority Queue→Fetcher · Async HTTP→Kafka · raw_pages→Parser · CPU worker→Content Store · S3 + Cassandra
Click Next Step to walk through the request flow.
07
Key Design Decisions & Tradeoffs
Chosen Approach
Domain Affinity (Static Partitioning)
Assign each domain to one fetcher machine via hash(domain) % N. Politeness tracking is entirely local — no locks, no cross-machine coordination. Simple, predictable, fast.
✓ No coordination overhead
Alternative
Global Distributed Lock per Domain
Any worker can crawl any domain, but must acquire a distributed lock (Redis SETNX) before hitting a domain. Flexible — no hot-spot machines. But lock contention at 10,000 domains × 1,000 workers becomes a serious bottleneck.
✗ Lock contention at scale
Chosen Approach
Bloom Filter for URL Dedup
12 GB RAM for 10B URLs. 1% false positive rate means occasionally skipping a genuinely new URL. Zero false negatives — never re-crawls a seen URL. The asymmetry is acceptable: re-crawling is expensive, missing 1% of new URLs is not.
✓ 85% memory reduction vs hash set
Alternative
Full URL Hash Set
Store a 64-bit hash of every URL. Zero false positives — perfect accuracy. But costs ~80 GB RAM for 10B URLs. No false positives sounds better, but in practice missing 1% of URLs is completely acceptable for a crawler.
~ Acceptable at smaller scale
Chosen Approach
Kafka Between Fetcher and Parser
Async decoupling via a message queue. Fetchers never block on parsers. Crashed parsers don't lose work. Backpressure handled automatically. Fetcher and parser fleets scale independently.
✓ Resilient, independently scalable
Alternative
Fetcher Calls Parser Directly (RPC)
Synchronous call — simpler architecture, lower latency. But fetcher throughput is now capped by parser throughput. A slow or crashed parser brings fetchers to a halt. No durability — in-flight work is lost on failure.
✗ Tight coupling, fragile
Chosen Approach
Adaptive Recrawl Scheduling
Re-crawl frequency adapts to observed change rate. nytimes.com/home changes daily → crawl daily. docs.stripe.com/api unchanged for months → crawl monthly. Exponential backoff on unchanged pages.
✓ Efficient resource allocation
Alternative
Fixed Recrawl Schedule
Crawl every page every N days regardless of change frequency. Simple to implement. But wastes enormous bandwidth recrawling static pages while potentially missing frequent updates on dynamic ones.
✗ Wastes crawl budget
08
What Can Go Wrong
🕸️
Spider Trap
A site generates infinite unique URLs — product catalogs, calendar pages, session-parameterised URLs. The frontier fills with millions of URLs from a single domain. Bloom filter is useless here — every URL genuinely is new. Crawl budget gets consumed by one site.
The URL Frontier is a central service — if it dies, all workers have no work to pull. Fetchers sitting idle. Worse, if the in-memory priority queue isn't persisted, you lose your crawl state entirely on restart.
→ Fix: Persist frontier to disk continuously + workers hold local batches (keep processing during brief outage)
🔄
Duplicate Crawl Under Failure
Fetcher pulls a batch of 100 URLs, crawls 60 of them, then crashes. On restart, those 60 URLs are re-dispatched from the frontier. You crawl them again — wasting bandwidth and potentially adding duplicate content to the store.
→ Fix: Frontier uses ack-based delivery — URL not removed until fetcher confirms success. Idempotent writes to content store (upsert by URL hash)
🚫
IP Ban / Rate Limit Violation
Domain affinity works per machine but a domain migration (new machine assigned) or a misconfigured crawl delay causes burst traffic to a domain. Site bans your IP range. Entire IP block becomes useless for that domain.
→ Fix: Multiple egress IP pools per region + exponential backoff on 429/503 responses + robots.txt re-fetch on ban detection
📈
Bloom Filter Saturation
A Bloom filter's false positive rate rises as it fills up. At 50% fill, a 3-hash filter approaches 10% false positive rate — meaning 1 in 10 genuinely new URLs gets skipped. The filter becomes too aggressive at high fill factors.
→ Fix: Monitor fill factor — rebuild/rotate filter when exceeding 40% capacity. Use counting Bloom filter variant for deletions.
09
Interview Tips
01
Start with the core loop, then identify what breaks. Most candidates jump straight to distributed architecture. Instead, describe the simple single-machine loop first, then say "here's why each part breaks at scale" — this shows structured thinking rather than pattern-matching.
02
Mention politeness early and unprompted. This is the trap most candidates miss. Interviewers specifically watch for whether you bring up robots.txt and crawl delays without being asked. It signals you understand the web is a shared resource, not just a target to scrape.
03
The Bloom filter question is almost guaranteed. Know it cold: what it does, why it's used instead of a hash set, what a false positive means in this context, and why false positives are acceptable but false negatives are not. The asymmetry is the key insight.
04
Distinguish fetcher scale from parser scale. These are I/O-bound and CPU-bound respectively. Scaling them separately is a real engineering decision — interviewers love when candidates identify that two components have different bottlenecks and should scale independently with Kafka decoupling them.
05
Common follow-up: "How do you handle JavaScript-heavy pages?" Modern SPAs don't render content in raw HTML — you'd need a headless browser (Puppeteer, Playwright) in your fetcher fleet, which is 10–100× slower than a plain HTTP fetch. Answer: tiered crawling — fast HTTP fetch first, headless browser only for pages that need it (detected by minimal HTML content).
06
Name the spider trap problem before being asked. Say: "One failure mode I want to address proactively is spider traps — infinite URL generation by a single domain." Then explain domain budgets and depth limits. Shows you understand adversarial inputs, not just happy-path architecture.
10
Similar Problems
The web crawler combines patterns from several other system design problems. Mastering it gives you strong intuition for all of these.
One Python script. Sequential fetches. In-memory set for dedup. Fine for crawling a single site or a few thousand pages. Hits a wall at ~172,000 pages/day. This is the right starting point — understand what breaks before reaching for distributed complexity.
Phase 2 — Single Machine, Async
Async I/O — 100× Throughput
Switch from sequential to async (Python asyncio, Node.js). Thousands of concurrent connections on one machine. Move dedup set to Redis so it survives restarts. Add robots.txt caching. Get to ~200,000 pages/hour. Most small-scale crawlers live here.
Phase 3 — Multi-Machine
Distributed Fetcher Fleet + Central Frontier
Introduce the URL Frontier as a central service. Multiple fetcher machines with domain affinity. Bloom filter replaces Redis set for memory efficiency. Separate fetching and parsing with Kafka. DNS cache service. Now reaching millions of pages/hour.
Phase 4 — Web Scale
SimHash, PageRank Feedback, Multi-Region
Add near-duplicate detection via SimHash. Feed link graph back to PageRank to improve frontier prioritization. Multi-region deployment for latency and redundancy. Tiered crawling — fast HTTP path + slower headless browser path for JS-heavy pages. This is Google / Bing territory — tens of thousands of machines, petabytes of storage.
Phase 5 — Specialization
Domain-Specific Crawlers
Separate crawl pipelines for different content types: news (freshness over coverage), e-commerce (structured data extraction), academic papers (PDF parsing), social media (API access over crawl). Each has a different freshness requirement, depth strategy, and parsing logic. The general-purpose crawler architecture becomes a platform others build on.