Google Maps

01

Problem Statement

Build a mapping service used by over a billion people. Users expect to see a zoomable, pannable map of the entire planet, search for any address or business, and get directions — all with sub-second latency. Behind the scenes, the system must ingest GPS data from hundreds of millions of devices, match those noisy signals to actual roads, aggregate them into live traffic speeds, and feed those speeds back into the routing engine so that every ETA reflects conditions right now.

This is three systems in a trenchcoat: a tile rendering pipeline (read-heavy, CDN-first), a routing engine (CPU-heavy, graph algorithms), and a real-time traffic pipeline (streaming, stateful). Each has different scaling characteristics.

Core question: How do you compute optimal routes across a billion-node road graph in milliseconds, while continuously updating edge weights from millions of live GPS probes?

02

Requirements

Functional Requirements

Map display — Render interactive, pannable, zoomable maps at multiple detail levels
Geocoding & search — Search for addresses, places, and businesses; convert between coordinates and human-readable addresses
Routing — Compute optimal route between two or more waypoints by car, transit, walking, or cycling
Live traffic — Display real-time traffic conditions as a colored overlay (red/yellow/green)
Navigation — Provide turn-by-turn directions with live ETA updates as conditions change

Non-Functional Requirements

Tile latency < 200ms — Anything slower feels laggy when panning
Routing latency < 1s — Users expect near-instant directions
Traffic freshness < 90s — GPS probe to traffic overlay in under 90 seconds
99.99% availability — People depend on this while driving
Global coverage — Every road, every country

03

Scale Estimation

Let's derive the numbers from reasonable assumptions rather than memorizing them.

~150K/s

Tile requests

~350/s

Routing queries

~20M/s

GPS probes ingested

~30 GB

Road graph in memory

Derivation

Tiles: ~1B MAU, ~300M DAU. Each session loads ~40 tiles from panning/zooming. 300M × 40 = 12B tiles/day → ~150K/sec. CDN absorbs 95%+, so origin sees ~7K/sec.

Routing: ~10% of sessions request directions → 30M/day → ~350 queries/sec. Modest compared to tiles.

GPS probes: ~1B Android devices, ~5-10% actively driving → 50–100M concurrent. At 1 probe per 3 seconds: ~15–30M probes/sec. This is the ingestion challenge.

Road graph: ~1B nodes (intersections), ~2B edges (road segments). With efficient encoding (adjacency arrays, 4 bytes per node ID, 4 bytes per weight): ~30–50 GB. Fits in RAM on a single server.

Key insight: This system is overwhelmingly read-heavy and tile-dominated. Tiles are the scalability challenge (CDN + caching). Routing is the algorithmic challenge (graph algorithms). Traffic is the streaming challenge (Kafka + Flink). Three separate problems, three separate solutions.

04

API Design

Get Map Tiles

        GET /tiles/{z}/{x}/{y}.pbf

Response: Protocol Buffer encoded vector tile
Headers:  Cache-Control: public, max-age=86400
          Content-Encoding: gzip

z = zoom level (0–20)
x = column index
y = row index
      

Get Directions

        GET /directions?origin={lat,lng}
               &destination={lat,lng}
               &mode={driving|walking|cycling|transit}
               &departure_time={unix_ts}
               &alternatives={true|false}

Response: {
  routes: [{
    distance_m:    42300,
    duration_s:    2580,
    duration_traffic_s: 3120,
    polyline:      "encoded_polyline",
    steps:         [{ instruction, distance, duration, maneuver }],
    traffic_speed_segments: [{ start_idx, end_idx, speed_category }]
  }]
}
      

Geocoding / Search

        GET /geocode?address={query}&bounds={sw_lat,sw_lng,ne_lat,ne_lng}

Response: {
  results: [{
    place_id:      "ChIJ...",
    formatted:     "Dubai Mall, Dubai, UAE",
    location:      { lat: 25.1972, lng: 55.2796 },
    types:         ["shopping_mall", "point_of_interest"]
  }]
}
      

Ingest GPS Probe (device → server)

        POST /probes  (batched, every 5-15 seconds)

Body: {
  device_id:  "hashed_anon_id",
  probes: [
    { lat, lng, speed_mps, heading_deg, accuracy_m, ts },
    { lat, lng, speed_mps, heading_deg, accuracy_m, ts },
    ...
  ]
}
      

05

High-Level Architecture

Three subsystems with different scaling profiles: a CDN-heavy tile pipeline, an in-memory routing engine, and a streaming traffic pipeline. They share the spatial database but are otherwise decoupled.

The red pipeline at the bottom is the real-time traffic system: GPS probes flow in from phones, get map-matched to road segments, aggregated into per-segment speeds, and fed back into the routing engine via CRP metric updates. The dashed amber line shows this feedback loop — updated edge weights flow from the aggregator back into the in-memory road graph.

Request Flow — Step Through

Client→CDN / LB→Routing Engine→Road Graph→Traffic Cache→Response

Click Next Step to walk through the request flow.

06

Deep Dive — Contraction Hierarchies

The routing engine's core algorithm. Dijkstra's algorithm works for city-scale graphs, but for a cross-country route (Dubai to Istanbul), it would explore millions of nodes — seconds of compute per query. We need something that exploits the structure of road networks. Enter Contraction Hierarchies (CH).

Core insight: For long-distance routes, you always use important roads. Nobody drives from Dubai to Abu Dhabi via residential streets. If the algorithm "knows" this, it can skip millions of irrelevant nodes.

The Algorithm in Two Phases

CH works in two phases: an expensive offline preprocessing step (run once, takes hours) and a blazing fast online query step (runs per request, takes milliseconds).

Phase 1 — Preprocessing: Node Contraction

Rank every node by "importance." A highway interchange serving 50,000 cars/day is important. A dead-end residential cul-de-sac is not. The ranking uses heuristics: how many shortcuts would contracting this node create? How many neighbors does it have? How deep in the hierarchy are its neighbors?

Then, contract nodes from least important to most important. "Contracting" a node means: remove it from the graph, and for every pair of its neighbors (u, v), check if the shortest path from u to v went through this node. If yes, add a shortcut edge u→v with weight = weight(u→node) + weight(node→v).

Contraction Walkthrough

Click Next Step to see how nodes get contracted and shortcut edges appear.

Phase 2 — Query: Bidirectional Upward Search

To find the shortest path from S to T, run two simultaneous Dijkstra searches:

Forward search from S — but only explore edges that go upward in the hierarchy (to more important nodes).
Backward search from T — same rule, only go upward.

The two searches meet at some high-importance node in the middle — typically a highway node. The shortest path passes through this meeting point. Because both searches only go upward, they each explore very few nodes — typically ~500 nodes per direction, for a total of about 1,000 node relaxations. Compare that to millions for plain Dijkstra.

sequenceDiagram participant C as Client participant LB as Load Balancer participant R as Routing Engine participant G as Road Graph (RAM) participant T as Traffic Cache C->>LB: GET /directions?origin=...&dest=... LB->>R: Forward to routing shard R->>G: Load CH graph for region R->>T: Fetch current edge weights (CRP) Note over R: Bidirectional CH search
Forward from origin ↑
Backward from dest ↑
Meet at highway node R->>R: Unpack shortcut edges → full path R->>R: Compute ETA from traffic weights R-->>LB: Route + polyline + ETA LB-->>C: JSON response

Unpacking Shortcuts

The path found by the query uses shortcut edges. To get the actual turn-by-turn directions, you unpack each shortcut recursively: a shortcut A→C was created when B was contracted, so it represents A→B→C. Shortcut B→D might unpack to B→X→D, and so on until you have the full path with every intersection.

Handling Live Traffic — CRP Integration

Standard CH has a problem: the preprocessing assumes static edge weights. If traffic changes, you'd need to re-preprocess the entire graph — which takes hours. Customizable Route Planning (CRP) solves this by separating concerns:

Metric-independent preprocessing: Partition the graph into cells based on topology. Precompute which boundary nodes connect cells. This depends on road topology only, not travel times. Run once.
Metric customization: When traffic changes, update edge weights and recompute only the cross-cell boundary distances. This takes seconds, not hours.

The real-time traffic pipeline (GPS probes → map matching → aggregation) produces updated segment speeds every 30–60 seconds. These feed into the CRP customization step, which updates the routing graph's edge weights in near real-time. The feedback loop closes: live traffic → updated weights → better ETAs.

Performance: CH preprocessing takes ~2–4 hours for a continent-sized graph. But queries run in 2–5 milliseconds, examining ~1,000 nodes instead of millions. With CRP, traffic updates apply in seconds without re-preprocessing.

07

One GPS glitch reporting 0 km/h on a highway drags the average down, creating phantom congestion. Mean is too sensitive to outliers for this use case.

08

What Can Go Wrong

Phantom Traffic Jams

A few phones in a parking lot adjacent to a highway report low speeds. The aggregation pipeline mistakenly attributes these to the highway, showing false congestion. Mitigation: Outlier filtering (discard probes >2σ from running median), minimum probe count thresholds, and cross-validation with adjacent segments.

GPS Drift in Tunnels & Urban Canyons

GPS accuracy degrades to 50–100m between tall buildings. Phones "jump" when exiting tunnels. Map matching assigns probes to wrong roads. Mitigation: HMM-based map matching considers trajectory continuity. Known tunnel geometries are pre-loaded. Low-accuracy probes are downweighted.

Traffic Pipeline Cold Start After Outage

If the streaming pipeline (Flink) restarts, all in-memory map-matching state and sliding window aggregations are lost. Roads temporarily show no live traffic data. Mitigation: Fall back to historical speed profiles (this road at this time-of-day). Flink checkpoints state to durable storage for faster recovery. Busy roads rebuild in ~5 minutes; quiet roads take longer.

Routing Graph Staleness During CRP Update

While CRP re-customizes edge weights (takes seconds), queries still run on the old weights. A sudden accident may not be reflected immediately. Mitigation: Double-buffer the graph — build updated weights on a shadow copy, then atomically swap. Incident detection can bypass the window for critical changes.

CDN Cache Stampede on Map Data Update

When new map data is published (weekly), all cached tiles are invalidated simultaneously, causing a thundering herd of origin requests. Mitigation: Staggered invalidation by region/zoom level. Jittered cache TTLs. Origin servers auto-scale ahead of scheduled publishes.

Sparse Data on Rural Roads

A back road with one probe per hour has terrible speed estimates. Routing may overestimate or underestimate travel time. Mitigation: Bayesian estimation with a strong prior from the road's speed limit and historical data. Confidence intervals are wider; ETA range shown to user is broader.

09

Interview Tips

💡

Scope aggressively.
Maps is three systems in one. Immediately tell the interviewer: "This has tile rendering, routing, and live traffic. I'd like to cover all three at a high level, then go deep on one. Which interests you most?" This shows maturity and lets you avoid a shallow skim of everything.

⚡

Derive the numbers.
Don't memorize "150K tile requests/sec." Instead: "1B MAU → 300M DAU → 40 tiles per session → 12B tiles/day → ~150K/sec." Interviewers are testing your ability to reason, not recall. Show the math.

🎯

Name-drop Contraction Hierarchies.
Most candidates say "Dijkstra" and stop. Knowing CH and explaining why Dijkstra isn't enough at scale is a strong differentiator. You don't need to implement it — just explain the two-phase approach: preprocess once (contract unimportant nodes, add shortcuts), query fast (bidirectional upward search).

🗺️

The tile pyramid is a CDN problem.
Once you explain the (z, x, y) addressing scheme and that tiles are static assets, the scaling story is straightforward: CDN caching with 95%+ hit rate. Don't over-complicate it. Spend your interview time on the harder parts — routing and traffic.

🔄

Close the feedback loop.
The most impressive thing you can draw is the full cycle: GPS probes → map matching → speed aggregation → CRP edge weight update → better ETAs → shown to user → user drives → generates more GPS probes. Show the interviewer you understand this is a system, not just components.

10

Evolution

How this design grows from MVP to planet-scale.

1

Single-City MVP

One city, raster tiles, A* routing on a small graph in memory, no live traffic. Tiles pre-rendered and served from a single server with Nginx. Graph fits in <1 GB. Good enough for a demo.

2

Country-Scale: Add CH + CDN

Switch to vector tiles for flexibility. Put tiles behind a CDN. Implement Contraction Hierarchies for fast cross-country routing. Add geocoding with Elasticsearch. This handles millions of users in one country.

3

Add Real-Time Traffic Pipeline

Ingest GPS probes via Kafka. Build the map-matching + aggregation pipeline with Flink. Overlay live traffic on tiles. Integrate CRP so routing reflects current conditions. ETA accuracy jumps dramatically.

4

Planet-Scale: Sharding + Multi-Region

Partition the road graph by continent/region. Deploy routing engines in each region. Multi-region CDN with edge caching. Cross-region routes stitch partial results. Handle 1B+ users across all countries.

5

Predictive Intelligence

ML models for traffic prediction (not just current conditions, but forecasts). Incident detection from speed anomalies. Rerouting suggestions pushed proactively. Integrate transit, cycling, walking multimodal routing. Indoor maps for airports and malls.

📺

References & Videos

Design Google Maps

ByteByteGo · 11 min

Google Maps System Design

Problem Statement

Requirements

Functional Requirements

Non-Functional Requirements

Scale Estimation

Derivation

API Design

High-Level Architecture

Deep Dive — Contraction Hierarchies

The Algorithm in Two Phases

Phase 1 — Preprocessing: Node Contraction

Phase 2 — Query: Bidirectional Upward Search

Unpacking Shortcuts

Handling Live Traffic — CRP Integration

Key Design Decisions & Tradeoffs

Vector Tiles vs Raster Tiles

Vector Tiles (Protobuf)

Raster Tiles (PNG)

Contraction Hierarchies vs Plain Dijkstra / A*

Contraction Hierarchies + CRP

A* (with heuristic)

Road Graph Storage: In-Memory vs On-Disk

In-Memory (~30–50 GB)

On-Disk (memory-mapped)

Tile Pre-rendering Strategy

Hybrid: Pre-render z0–14, On-Demand z15+

Fully On-Demand

Traffic Aggregation: Median vs Mean

Weighted Median + Time Decay

Simple Mean

What Can Go Wrong

Phantom Traffic Jams

GPS Drift in Tunnels & Urban Canyons

Traffic Pipeline Cold Start After Outage

Routing Graph Staleness During CRP Update

CDN Cache Stampede on Map Data Update

Sparse Data on Rural Roads

Interview Tips

Similar Problems

Uber / Ride Sharing

Notification System

Distributed Message Queue

Evolution

Single-City MVP

Country-Scale: Add CH + CDN

Add Real-Time Traffic Pipeline

Planet-Scale: Sharding + Multi-Region

Predictive Intelligence

References & Videos

Uber / Ride Sharing

Notification System

DNS