Design a global mapping platform that renders interactive maps, computes optimal routes in milliseconds, and ingests real-time GPS data from millions of phones to estimate live traffic and ETAs.
Build a mapping service used by over a billion people. Users expect to see a zoomable, pannable map of the entire planet, search for any address or business, and get directions — all with sub-second latency. Behind the scenes, the system must ingest GPS data from hundreds of millions of devices, match those noisy signals to actual roads, aggregate them into live traffic speeds, and feed those speeds back into the routing engine so that every ETA reflects conditions right now.
This is three systems in a trenchcoat: a tile rendering pipeline (read-heavy, CDN-first), a routing engine (CPU-heavy, graph algorithms), and a real-time traffic pipeline (streaming, stateful). Each has different scaling characteristics.
Core question: How do you compute optimal routes across a billion-node road graph in milliseconds, while continuously updating edge weights from millions of live GPS probes?
Traffic freshness < 90s — GPS probe to traffic overlay in under 90 seconds
99.99% availability — People depend on this while driving
Global coverage — Every road, every country
03
Scale Estimation
Let's derive the numbers from reasonable assumptions rather than memorizing them.
~150K/s
Tile requests
~350/s
Routing queries
~20M/s
GPS probes ingested
~30 GB
Road graph in memory
Derivation
Tiles: ~1B MAU, ~300M DAU. Each session loads ~40 tiles from panning/zooming.
300M × 40 = 12B tiles/day → ~150K/sec. CDN absorbs 95%+, so origin sees ~7K/sec.
Routing: ~10% of sessions request directions → 30M/day → ~350 queries/sec. Modest compared to tiles.
GPS probes: ~1B Android devices, ~5-10% actively driving → 50–100M concurrent. At 1 probe per 3 seconds: ~15–30M probes/sec. This is the ingestion challenge.
Road graph: ~1B nodes (intersections), ~2B edges (road segments). With efficient encoding (adjacency arrays, 4 bytes per node ID, 4 bytes per weight): ~30–50 GB. Fits in RAM on a single server.
Key insight: This system is overwhelmingly read-heavy and tile-dominated. Tiles are the scalability challenge (CDN + caching). Routing is the algorithmic challenge (graph algorithms). Traffic is the streaming challenge (Kafka + Flink). Three separate problems, three separate solutions.
04
API Design
Get Map Tiles
GET /tiles/{z}/{x}/{y}.pbf
Response: Protocol Buffer encoded vector tile
Headers: Cache-Control: public, max-age=86400
Content-Encoding: gzip
z = zoom level (0–20)
x = column index
y = row index
POST /probes (batched, every 5-15 seconds)
Body: {
device_id: "hashed_anon_id",
probes: [
{ lat, lng, speed_mps, heading_deg, accuracy_m, ts },
{ lat, lng, speed_mps, heading_deg, accuracy_m, ts },
...
]
}
05
High-Level Architecture
Three subsystems with different scaling profiles: a CDN-heavy tile pipeline, an in-memory routing engine, and a streaming traffic pipeline. They share the spatial database but are otherwise decoupled.
The red pipeline at the bottom is the real-time traffic system: GPS probes flow in from phones, get map-matched to road segments, aggregated into per-segment speeds, and fed back into the routing engine via CRP metric updates. The dashed amber line shows this feedback loop — updated edge weights flow from the aggregator back into the in-memory road graph.
The routing engine's core algorithm. Dijkstra's algorithm works for city-scale graphs, but for a cross-country route (Dubai to Istanbul), it would explore millions of nodes — seconds of compute per query. We need something that exploits the structure of road networks. Enter Contraction Hierarchies (CH).
Core insight: For long-distance routes, you always use important roads. Nobody drives from Dubai to Abu Dhabi via residential streets. If the algorithm "knows" this, it can skip millions of irrelevant nodes.
The Algorithm in Two Phases
CH works in two phases: an expensive offline preprocessing step (run once, takes hours) and a blazing fast online query step (runs per request, takes milliseconds).
Phase 1 — Preprocessing: Node Contraction
Rank every node by "importance." A highway interchange serving 50,000 cars/day is important. A dead-end residential cul-de-sac is not. The ranking uses heuristics: how many shortcuts would contracting this node create? How many neighbors does it have? How deep in the hierarchy are its neighbors?
Then, contract nodes from least important to most important. "Contracting" a node means: remove it from the graph, and for every pair of its neighbors (u, v), check if the shortest path from u to v went through this node. If yes, add a shortcut edge u→v with weight = weight(u→node) + weight(node→v).
Contraction Walkthrough
Click Next Step to see how nodes get contracted and shortcut edges appear.
Phase 2 — Query: Bidirectional Upward Search
To find the shortest path from S to T, run two simultaneous Dijkstra searches:
Forward search from S — but only explore edges that go upward in the hierarchy (to more important nodes).
Backward search from T — same rule, only go upward.
The two searches meet at some high-importance node in the middle — typically a highway node. The shortest path passes through this meeting point. Because both searches only go upward, they each explore very few nodes — typically ~500 nodes per direction, for a total of about 1,000 node relaxations. Compare that to millions for plain Dijkstra.
sequenceDiagram
participant C as Client
participant LB as Load Balancer
participant R as Routing Engine
participant G as Road Graph (RAM)
participant T as Traffic Cache
C->>LB: GET /directions?origin=...&dest=...
LB->>R: Forward to routing shard
R->>G: Load CH graph for region
R->>T: Fetch current edge weights (CRP)
Note over R: Bidirectional CH search Forward from origin ↑ Backward from dest ↑ Meet at highway node
R->>R: Unpack shortcut edges → full path
R->>R: Compute ETA from traffic weights
R-->>LB: Route + polyline + ETA
LB-->>C: JSON response
Unpacking Shortcuts
The path found by the query uses shortcut edges. To get the actual turn-by-turn directions, you unpack each shortcut recursively: a shortcut A→C was created when B was contracted, so it represents A→B→C. Shortcut B→D might unpack to B→X→D, and so on until you have the full path with every intersection.
Handling Live Traffic — CRP Integration
Standard CH has a problem: the preprocessing assumes static edge weights. If traffic changes, you'd need to re-preprocess the entire graph — which takes hours. Customizable Route Planning (CRP) solves this by separating concerns:
Metric-independent preprocessing: Partition the graph into cells based on topology. Precompute which boundary nodes connect cells. This depends on road topology only, not travel times. Run once.
Metric customization: When traffic changes, update edge weights and recompute only the cross-cell boundary distances. This takes seconds, not hours.
The real-time traffic pipeline (GPS probes → map matching → aggregation) produces updated segment speeds every 30–60 seconds. These feed into the CRP customization step, which updates the routing graph's edge weights in near real-time. The feedback loop closes: live traffic → updated weights → better ETAs.
Performance: CH preprocessing takes ~2–4 hours for a continent-sized graph. But queries run in 2–5 milliseconds, examining ~1,000 nodes instead of millions. With CRP, traffic updates apply in seconds without re-preprocessing.
07
Key Design Decisions & Tradeoffs
Vector Tiles vs Raster Tiles
✓ Chosen
Vector Tiles (Protobuf)
Send geometry data (lines, polygons, points); client GPU renders. Smaller payloads (20–50 KB), supports rotation/3D/restyling, decoupled from display resolution.
✗ Alternative
Raster Tiles (PNG)
Pre-rendered images (50–200 KB). Simpler clients, but no rotation/3D, wasteful bandwidth, and you must re-render for any style change. Legacy approach.
Contraction Hierarchies vs Plain Dijkstra / A*
✓ Chosen
Contraction Hierarchies + CRP
Hours of preprocessing, but queries in 2–5ms exploring ~1,000 nodes. CRP layer allows live traffic updates in seconds. Dominates at scale.
✗ Alternative
A* (with heuristic)
No preprocessing needed, simpler to implement. But cross-country routes explore millions of nodes — seconds per query. Fine for small graphs, not planet-scale.
Road Graph Storage: In-Memory vs On-Disk
✓ Chosen
In-Memory (~30–50 GB)
Graph fits in RAM. Sub-millisecond node access. Required for 2–5ms query latency. Higher server cost, but routing is only ~350 QPS — a few beefy servers suffice.
✗ Alternative
On-Disk (memory-mapped)
Lower memory footprint, but random access latency jumps from nanoseconds to microseconds. Routing queries would take 10–100× longer. Only viable if graph doesn't fit in RAM.
Tile Pre-rendering Strategy
✓ Chosen
Hybrid: Pre-render z0–14, On-Demand z15+
Pre-render ~268M tiles for common zoom levels. Deep zooms and rural areas rendered on first request, then cached. Balances storage (petabytes saved) with latency.
✗ Alternative
Fully On-Demand
No pre-rendering, everything generated live. Simpler pipeline, but first-load latency spikes for popular areas. CDN cache-miss storms during peak hours.
Traffic Aggregation: Median vs Mean
✓ Chosen
Weighted Median + Time Decay
Robust to outliers (GPS glitches, parking-lot probes adjacent to highways). Recent probes weighted higher. Falls back to historical prior when data is sparse.
✗ Alternative
Simple Mean
One GPS glitch reporting 0 km/h on a highway drags the average down, creating phantom congestion. Mean is too sensitive to outliers for this use case.
08
What Can Go Wrong
Phantom Traffic Jams
A few phones in a parking lot adjacent to a highway report low speeds. The aggregation pipeline mistakenly attributes these to the highway, showing false congestion. Mitigation: Outlier filtering (discard probes >2σ from running median), minimum probe count thresholds, and cross-validation with adjacent segments.
GPS Drift in Tunnels & Urban Canyons
GPS accuracy degrades to 50–100m between tall buildings. Phones "jump" when exiting tunnels. Map matching assigns probes to wrong roads. Mitigation: HMM-based map matching considers trajectory continuity. Known tunnel geometries are pre-loaded. Low-accuracy probes are downweighted.
Traffic Pipeline Cold Start After Outage
If the streaming pipeline (Flink) restarts, all in-memory map-matching state and sliding window aggregations are lost. Roads temporarily show no live traffic data. Mitigation: Fall back to historical speed profiles (this road at this time-of-day). Flink checkpoints state to durable storage for faster recovery. Busy roads rebuild in ~5 minutes; quiet roads take longer.
Routing Graph Staleness During CRP Update
While CRP re-customizes edge weights (takes seconds), queries still run on the old weights. A sudden accident may not be reflected immediately. Mitigation: Double-buffer the graph — build updated weights on a shadow copy, then atomically swap. Incident detection can bypass the window for critical changes.
CDN Cache Stampede on Map Data Update
When new map data is published (weekly), all cached tiles are invalidated simultaneously, causing a thundering herd of origin requests. Mitigation: Staggered invalidation by region/zoom level. Jittered cache TTLs. Origin servers auto-scale ahead of scheduled publishes.
Sparse Data on Rural Roads
A back road with one probe per hour has terrible speed estimates. Routing may overestimate or underestimate travel time. Mitigation: Bayesian estimation with a strong prior from the road's speed limit and historical data. Confidence intervals are wider; ETA range shown to user is broader.
09
Interview Tips
💡
Scope aggressively.
Maps is three systems in one. Immediately tell the interviewer: "This has tile rendering, routing, and live traffic. I'd like to cover all three at a high level, then go deep on one. Which interests you most?" This shows maturity and lets you avoid a shallow skim of everything.
⚡
Derive the numbers.
Don't memorize "150K tile requests/sec." Instead: "1B MAU → 300M DAU → 40 tiles per session → 12B tiles/day → ~150K/sec." Interviewers are testing your ability to reason, not recall. Show the math.
🎯
Name-drop Contraction Hierarchies.
Most candidates say "Dijkstra" and stop. Knowing CH and explaining why Dijkstra isn't enough at scale is a strong differentiator. You don't need to implement it — just explain the two-phase approach: preprocess once (contract unimportant nodes, add shortcuts), query fast (bidirectional upward search).
🗺️
The tile pyramid is a CDN problem.
Once you explain the (z, x, y) addressing scheme and that tiles are static assets, the scaling story is straightforward: CDN caching with 95%+ hit rate. Don't over-complicate it. Spend your interview time on the harder parts — routing and traffic.
🔄
Close the feedback loop.
The most impressive thing you can draw is the full cycle: GPS probes → map matching → speed aggregation → CRP edge weight update → better ETAs → shown to user → user drives → generates more GPS probes. Show the interviewer you understand this is a system, not just components.
One city, raster tiles, A* routing on a small graph in memory, no live traffic. Tiles pre-rendered and served from a single server with Nginx. Graph fits in <1 GB. Good enough for a demo.
2
Country-Scale: Add CH + CDN
Switch to vector tiles for flexibility. Put tiles behind a CDN. Implement Contraction Hierarchies for fast cross-country routing. Add geocoding with Elasticsearch. This handles millions of users in one country.
3
Add Real-Time Traffic Pipeline
Ingest GPS probes via Kafka. Build the map-matching + aggregation pipeline with Flink. Overlay live traffic on tiles. Integrate CRP so routing reflects current conditions. ETA accuracy jumps dramatically.
4
Planet-Scale: Sharding + Multi-Region
Partition the road graph by continent/region. Deploy routing engines in each region. Multi-region CDN with edge caching. Cross-region routes stitch partial results. Handle 1B+ users across all countries.
5
Predictive Intelligence
ML models for traffic prediction (not just current conditions, but forecasts). Incident detection from speed anomalies. Rerouting suggestions pushed proactively. Integrate transit, cycling, walking multimodal routing. Indoor maps for airports and malls.