Web Crawler

A system that automatically traverses the web — starting from seed URLs, fetching pages, extracting links, and repeating — to build a searchable index of internet content. Deceptively simple at small scale; one of the hardest distributed systems problems at Google scale.

⚡ Core: URL Frontier + Deduplication10B pagesWrite-heavyDistributedPoliteness-aware
02

Requirements

Functional
  • Given seed URLs, crawl and index reachable web pages
  • Extract and follow all hyperlinks discovered on each page
  • Store raw HTML and parsed content for indexing
  • Detect and skip duplicate content — exact and near-duplicate
  • Respect robots.txt and per-domain crawl delays
  • Re-crawl pages periodically to keep index fresh
Non-Functional
  • Crawl 10 billion pages within 30 days
  • Prioritise high-authority pages — crawl important content first
  • Polite: max 1 req/sec per domain unless specified otherwise
  • Resilient — worker failures must not lose URLs or duplicate work
  • Horizontally scalable — add machines to increase throughput
  • Extensible — support multiple downstream consumers (search, ML)
Scope Decision

We are building the crawler pipeline only — not the search index, ranking algorithm, or query serving layer. Our output is a content store of fetched pages with metadata. Indexing and ranking are downstream consumers of our output.

03

Scale Estimation

MetricCalculationResult
Target pages Industry benchmark — Google's index 10 billion
Crawl window Requirement 30 days
Required crawl rate 10B ÷ (30 × 86,400s) ~3,900 pages/sec
Average page size Typical HTML page incl. inline resources ~100 KB
Inbound bandwidth 3,900 × 100 KB ~390 MB/s = 3 Gbps
Raw HTML storage (5yr) 10B × 100 KB ~1 PB
URL frontier size 10B URLs × 100 bytes metadata ~1 TB
Bloom filter memory 10B × 10 bits (1% false positive rate) ~12 GB RAM
Key Insight

A single machine fetching 2 pages/sec (500ms network round-trip) would take 158 years to crawl 10B pages. The fundamental challenge is not algorithmic complexity — it is I/O-bound parallelism at massive scale. Every architectural decision flows from this constraint.

04

API Design

A crawler is primarily an internal system — its "API" is the interface between its own components, plus a control plane for operators. External consumers (search indexers, ML pipelines) read from the content store directly.

POST /v1/crawl/seeds Submit seed URLs to start a crawl
// Request — add seed URLs with optional priority override { "urls": ["https://wikipedia.org", "https://github.com"], "priority": "high", "crawl_depth": 8 } // Response 202 Accepted { "job_id": "crawl_a1b2c3", "seeds_queued": 2, "status": "accepted" }
GET /v1/crawl/status/:job_id Check crawl job progress
// Response 200 { "job_id": "crawl_a1b2c3", "pages_crawled": 4821903, "pages_queued": 12043221, "errors": 1203, "crawl_rate_rps": 3847 }
GET /v1/pages/:url_hash Retrieve crawled page content
// Response 200 — for downstream consumers (indexers, ML pipelines) { "url": "https://en.wikipedia.org/wiki/Alan_Turing", "crawled_at": "2025-01-01T12:00:00Z", "content_hash": "a3f8c2d1...", "simhash": "1b3f8a2c...", "raw_html_s3": "s3://crawl-store/2025/01/a3f8c2d1.html.gz", "outbound_links": 243, "language": "en", "http_status": 200 }
POST /v1/domains/:domain/block Operator override — block a domain

Returns 204 No Content. Immediately stops all pending fetches for this domain. Used for legal compliance, abuse prevention, or site owner requests.

05

High-Level Architecture

Architecture — Full Crawler Pipeline SVG Diagram
INPUT Seed URLs sitemap.xml BRAIN URL Frontier Priority Queue Bloom Filter Domain Budgets Recrawl Sched. batches of 100 INFRA DNS Cache >99% hit rate I/O BOUND Fetcher Fleet robots.txt cache domain affinity 5K async conn/node politeness delays N × machines BUFFER Kafka raw_pages topic durable backpressure CPU BOUND Parser Fleet strip boilerplate extract links compute SimHash detect near-dups M × machines STORAGE S3 / GCS raw HTML · 1PB STORAGE Cassandra parsed content GRAPH Link Graph PageRank input extracted links loop back to frontier Primary path Link feedback loop Async / optional
URL Frontier Service

The brain of the system. Maintains the priority queue of URLs to crawl, runs Bloom filter dedup, tracks per-domain crawl budgets, and schedules recrawls. Workers pull batches of 100 URLs — not one at a time — to reduce coordination overhead.

Fetcher Fleet

I/O-bound workers that make HTTP requests. Each machine maintains thousands of async connections simultaneously. Uses domain affinity — each domain is owned by one machine — eliminating cross-machine coordination for politeness.

Kafka (raw_pages)

Decouples fetching from parsing. Fetchers produce pages at variable rates; parsers consume at their own pace. Provides durability — a crashed parser doesn't lose work, the message is redelivered. Handles backpressure automatically.

Parser Fleet

CPU-bound workers. Strip boilerplate, extract links, compute SimHash for near-duplicate detection, extract metadata. Scaled independently from fetchers — different hardware profile, different bottleneck (CPU vs network).

06

Deep Dive — The Deduplication Stack

Why This Is The Core Problem

The web is ~60–70% duplicate content by some estimates — syndicated news, scraped pages, boilerplate-wrapped articles, URL parameter variants. Without a rigorous dedup stack, most of your crawl budget is wasted on content you've already seen. Every layer catches a different class of duplicate.

Every URL passes through three deduplication layers before it enters the index. Each layer is faster and cheaper than the next, so most duplicates are caught early:

L1 URL Normalization Strip tracking params, session IDs, fragments. ?utm_source=twitter → removed. Canonical form before any lookup. ~0ms · no storage
L2 Bloom Filter Check if normalized URL has been seen. 12GB RAM for 10B URLs. 1% false positive rate — occasionally skips a new URL, never re-crawls a seen one. <1ms · 12 GB RAM
L3 Content Hash (MD5) After fetching, hash the raw HTML. Exact byte-for-byte duplicate detection. Catches mirrors and identical copies at different URLs. <1ms · 16B per page
L4 SimHash (Near-Duplicate) Compute 64-bit SimHash of page content. Find all stored hashes within Hamming distance ≤ 3. Catches syndicated articles, scraped content, boilerplate variations. ~2ms · 8B per page
Bloom Filter — Interactive Demo (16-bit, 3 hash functions)
Add a URL to mark it as seen, then check it again to see the filter in action.
Sequence — URL Discovery to Index Mermaid.js
sequenceDiagram participant P as Parser participant BF as Bloom Filter participant F as Frontier participant FT as Fetcher participant SH as SimHash Store participant DB as Content Store P->>P: Extract link from page P->>P: Normalize URL (strip tracking params) P->>BF: Has this URL been seen? alt Already seen BF-->>P: YES — skip (or 1% false positive) else New URL BF-->>P: NO — proceed P->>F: Add URL with priority score F->>FT: Dispatch URL batch (100 at a time) FT->>FT: Check robots.txt cache FT->>FT: Respect crawl-delay FT->>FT: HTTP GET with ETag/If-Modified-Since alt 304 Not Modified FT-->>F: Content unchanged — skip re-index else 200 OK FT->>BF: Mark URL as seen FT->>SH: Check SimHash (near-dup?) alt Near-duplicate found SH-->>FT: Hamming distance ≤ 3 — skip else Unique content FT->>DB: Store raw HTML + metadata FT->>P: Emit to Kafka for parsing end end end

The SimHash lookup at scale deserves special attention. With 10 billion stored hashes, naively comparing a new hash against all of them would be impossibly slow. The trick exploits a mathematical property: if two 64-bit hashes differ in at most 3 bits, and you split the hash into 4 chunks of 16 bits, at least one chunk must be identical (by the pigeonhole principle — 3 flips across 4 buckets leaves one untouched).

So you maintain 4 hash tables, each indexed by a different 16-bit chunk. A near-duplicate lookup becomes 4 hash table lookups instead of 10 billion comparisons. This is the same paper Google published in 2007 describing how they detect near-duplicates at web scale.

Request Flow — Step Through
URL Frontier · Priority QueueFetcher · Async HTTPKafka · raw_pagesParser · CPU workerContent Store · S3 + Cassandra
Click Next Step to walk through the request flow.
07

Key Design Decisions & Tradeoffs

Chosen Approach
Domain Affinity (Static Partitioning)

Assign each domain to one fetcher machine via hash(domain) % N. Politeness tracking is entirely local — no locks, no cross-machine coordination. Simple, predictable, fast.

✓ No coordination overhead
Alternative
Global Distributed Lock per Domain

Any worker can crawl any domain, but must acquire a distributed lock (Redis SETNX) before hitting a domain. Flexible — no hot-spot machines. But lock contention at 10,000 domains × 1,000 workers becomes a serious bottleneck.

✗ Lock contention at scale
Chosen Approach
Bloom Filter for URL Dedup

12 GB RAM for 10B URLs. 1% false positive rate means occasionally skipping a genuinely new URL. Zero false negatives — never re-crawls a seen URL. The asymmetry is acceptable: re-crawling is expensive, missing 1% of new URLs is not.

✓ 85% memory reduction vs hash set
Alternative
Full URL Hash Set

Store a 64-bit hash of every URL. Zero false positives — perfect accuracy. But costs ~80 GB RAM for 10B URLs. No false positives sounds better, but in practice missing 1% of URLs is completely acceptable for a crawler.

~ Acceptable at smaller scale
Chosen Approach
Kafka Between Fetcher and Parser

Async decoupling via a message queue. Fetchers never block on parsers. Crashed parsers don't lose work. Backpressure handled automatically. Fetcher and parser fleets scale independently.

✓ Resilient, independently scalable
Alternative
Fetcher Calls Parser Directly (RPC)

Synchronous call — simpler architecture, lower latency. But fetcher throughput is now capped by parser throughput. A slow or crashed parser brings fetchers to a halt. No durability — in-flight work is lost on failure.

✗ Tight coupling, fragile
Chosen Approach
Adaptive Recrawl Scheduling

Re-crawl frequency adapts to observed change rate. nytimes.com/home changes daily → crawl daily. docs.stripe.com/api unchanged for months → crawl monthly. Exponential backoff on unchanged pages.

✓ Efficient resource allocation
Alternative
Fixed Recrawl Schedule

Crawl every page every N days regardless of change frequency. Simple to implement. But wastes enormous bandwidth recrawling static pages while potentially missing frequent updates on dynamic ones.

✗ Wastes crawl budget
08

What Can Go Wrong

🕸️
Spider Trap

A site generates infinite unique URLs — product catalogs, calendar pages, session-parameterised URLs. The frontier fills with millions of URLs from a single domain. Bloom filter is useless here — every URL genuinely is new. Crawl budget gets consumed by one site.

→ Fix: Per-domain crawl budget + URL pattern detection + max crawl depth = 16
💀
Frontier Service Goes Down

The URL Frontier is a central service — if it dies, all workers have no work to pull. Fetchers sitting idle. Worse, if the in-memory priority queue isn't persisted, you lose your crawl state entirely on restart.

→ Fix: Persist frontier to disk continuously + workers hold local batches (keep processing during brief outage)
🔄
Duplicate Crawl Under Failure

Fetcher pulls a batch of 100 URLs, crawls 60 of them, then crashes. On restart, those 60 URLs are re-dispatched from the frontier. You crawl them again — wasting bandwidth and potentially adding duplicate content to the store.

→ Fix: Frontier uses ack-based delivery — URL not removed until fetcher confirms success. Idempotent writes to content store (upsert by URL hash)
🚫
IP Ban / Rate Limit Violation

Domain affinity works per machine but a domain migration (new machine assigned) or a misconfigured crawl delay causes burst traffic to a domain. Site bans your IP range. Entire IP block becomes useless for that domain.

→ Fix: Multiple egress IP pools per region + exponential backoff on 429/503 responses + robots.txt re-fetch on ban detection
📈
Bloom Filter Saturation

A Bloom filter's false positive rate rises as it fills up. At 50% fill, a 3-hash filter approaches 10% false positive rate — meaning 1 in 10 genuinely new URLs gets skipped. The filter becomes too aggressive at high fill factors.

→ Fix: Monitor fill factor — rebuild/rotate filter when exceeding 40% capacity. Use counting Bloom filter variant for deletions.
09

Interview Tips

01

Start with the core loop, then identify what breaks. Most candidates jump straight to distributed architecture. Instead, describe the simple single-machine loop first, then say "here's why each part breaks at scale" — this shows structured thinking rather than pattern-matching.

02

Mention politeness early and unprompted. This is the trap most candidates miss. Interviewers specifically watch for whether you bring up robots.txt and crawl delays without being asked. It signals you understand the web is a shared resource, not just a target to scrape.

03

The Bloom filter question is almost guaranteed. Know it cold: what it does, why it's used instead of a hash set, what a false positive means in this context, and why false positives are acceptable but false negatives are not. The asymmetry is the key insight.

04

Distinguish fetcher scale from parser scale. These are I/O-bound and CPU-bound respectively. Scaling them separately is a real engineering decision — interviewers love when candidates identify that two components have different bottlenecks and should scale independently with Kafka decoupling them.

05

Common follow-up: "How do you handle JavaScript-heavy pages?" Modern SPAs don't render content in raw HTML — you'd need a headless browser (Puppeteer, Playwright) in your fetcher fleet, which is 10–100× slower than a plain HTTP fetch. Answer: tiered crawling — fast HTTP fetch first, headless browser only for pages that need it (detected by minimal HTML content).

06

Name the spider trap problem before being asked. Say: "One failure mode I want to address proactively is spider traps — infinite URL generation by a single domain." Then explain domain budgets and depth limits. Shows you understand adversarial inputs, not just happy-path architecture.

11

How the Design Evolves

Phase 1 — Single Machine
The Naive Loop

One Python script. Sequential fetches. In-memory set for dedup. Fine for crawling a single site or a few thousand pages. Hits a wall at ~172,000 pages/day. This is the right starting point — understand what breaks before reaching for distributed complexity.

Phase 2 — Single Machine, Async
Async I/O — 100× Throughput

Switch from sequential to async (Python asyncio, Node.js). Thousands of concurrent connections on one machine. Move dedup set to Redis so it survives restarts. Add robots.txt caching. Get to ~200,000 pages/hour. Most small-scale crawlers live here.

Phase 3 — Multi-Machine
Distributed Fetcher Fleet + Central Frontier

Introduce the URL Frontier as a central service. Multiple fetcher machines with domain affinity. Bloom filter replaces Redis set for memory efficiency. Separate fetching and parsing with Kafka. DNS cache service. Now reaching millions of pages/hour.

Phase 4 — Web Scale
SimHash, PageRank Feedback, Multi-Region

Add near-duplicate detection via SimHash. Feed link graph back to PageRank to improve frontier prioritization. Multi-region deployment for latency and redundancy. Tiered crawling — fast HTTP path + slower headless browser path for JS-heavy pages. This is Google / Bing territory — tens of thousands of machines, petabytes of storage.

Phase 5 — Specialization
Domain-Specific Crawlers

Separate crawl pipelines for different content types: news (freshness over coverage), e-commerce (structured data extraction), academic papers (PDF parsing), social media (API access over crawl). Each has a different freshness requirement, depth strategy, and parsing logic. The general-purpose crawler architecture becomes a platform others build on.

Next up