A three-sided marketplace: eaters order, merchants prepare, Dashers deliver. The hard parts:
a dispatch engine that assigns the right Dasher to the right order in real-time optimizing for ETA, efficiency, and driver earnings; a live ETA system updating predictions every few seconds based on traffic, restaurant prep time, and Dasher position;
and a transactional order lifecycle that doesn't lose your $47 burrito when the payment processor burps. DoorDash does ~2B orders/year; ~6M orders/day.
6M/day × ~10× peak-to-avg; concentrated 6–8 pm local time
Active Dashers / sec
~300K
sending location pings every 4 s = ~75K pings/sec baseline
Geospatial queries / sec
~10K
"nearby Dashers", "nearby restaurants" — hot QPS on the geo index
Dispatch latency budget
~3 s
eater waits for "assigning Dasher"; longer feels broken
Order storage
~60 GB / day
6M orders × ~10 KB each incl. line items + events; ~22 TB/yr hot
Live tracking sessions
~500K
concurrent during dinner peak; each gets Dasher location pushed every ~4 s
04
API Design
GET/restaurants?lat=X&lng=Y&radius=3km
Geo-indexed restaurant search. Returns {restaurants: [{id, name, cuisine, eta_range, fee, rating}], hot_tiles}. Served from a read replica with restaurants clustered by H3/geohash cell.
POST/orders
Create order with idempotency key. Body: {restaurant_id, items[], address, payment_token, tip_amount}. Returns {order_id, total, estimated_eta, status: 'pending_restaurant'}. Strongly consistent; wrapped in a transaction with payment auth.
GET/orders/{id}/track
Live order status + Dasher location. Eater clients poll this every 4 s (or WebSocket push). Returns current {state, dasher_location, eta, elapsed}.
POST/dasher/location
Dasher app posts GPS ping every 4 s when active. Goes to the geo index + the live-tracking push bus. Fire-and-forget; retry on failure.
POST/dasher/offers/{id}/respond
Dasher accepts/declines a dispatched offer. Response window is ~30 s; server auto-rejects and re-dispatches after timeout.
POST/merchants/{id}/orders/{oid}/prep-time
Restaurant POS integration or tablet app updates ready-by-time. Critical signal for dispatch timing.
POST/orders/{id}/tips
Post-delivery tip modification. Hits payment service; updates order ledger. Idempotent.
05
Architecture
Three primary domains: Marketplace (restaurants, menus, browse), Order (checkout, lifecycle, payments), and Logistics (dispatch, ETAs, live tracking). Each is an independent service with its own datastore. They communicate via events on Kafka.
Dispatch is the heart of the system. An order ready (or soon-to-be ready) needs matching to a Dasher with the best tradeoff across: pickup distance, estimated total trip time, driver earnings fairness, batching potential (combine with nearby orders), and restaurant prep synchronization. This is a combinatorial assignment problem at scale.
Naive approach: nearest Dasher wins. Good enough for small markets, terrible at scale — you'd leave Dashers idle on one side of the map while orders pile up on the other.
Modern dispatch algorithm (simplified):
Candidate generation. Query H3 geo-index for Dashers within ~3 km of restaurant. Typically 10–50 candidates.
Feature extraction. For each (Dasher, order) pair compute: pickup distance, Dasher current queue length, restaurant prep time, historical Dasher acceptance rate, current traffic on route.
Score each pair with an ML model predicting "delivery quality" — likelihood of on-time + Dasher acceptance + CS-call-free outcome.
Solve as assignment problem. For N open orders × M candidate Dashers, run the Hungarian algorithm (or approximation) to maximize total score subject to "each Dasher gets at most 1 order." Batch every ~2 s to give multiple orders a chance to co-optimize.
Offer to winning Dasher. Send push notification; 30 s response window. If declined, re-run assignment with remaining candidates.
Consider batching. Two orders at the same restaurant or two orders on the same route → offer both to one Dasher; cut cost per delivery ~30%.
Dispatch Flow — Order Placed → Dasher AssignedMermaid
sequenceDiagram
participant E as Eater
participant O as Order svc
participant M as Merchant
participant D as Dispatch
participant GEO as Location svc
participant DSR as Dasher
E->>O: POST /orders
O->>M: forward + prep-time request
M-->>O: accept + 12 min
O->>D: OrderReady @ T+10 min event
D->>GEO: nearby Dashers within 3 km
GEO-->>D: candidates [D1..D25]
Note over D: score candidates ML + Hungarian
D->>DSR: offer to best Dasher
alt accepted
DSR-->>D: accept
D->>O: AssignmentCommitted
O-->>E: ETA + Dasher info
else declined / timeout 30s
D->>D: remove + reassign
end
Why not run dispatch on every order immediately? Because the optimal assignment often looks 60–120 s into the future. Running a batched window lets "order A places at T, order B places at T+30 s near the same restaurant" be offered as a batch. Pure first-come-first-served leaves money on the table.
Location service. Every Dasher pings location every ~4 s. The service maintains a live H3 geo-index: H3 cell → set of Dasher IDs. Lookups are ring queries — "give me Dashers in this cell and its 18 neighbors." Stored in Redis for speed; bulk location history archived to Cassandra for ML training.
ETA. A separate ML service with features: distance, time-of-day traffic, historical restaurant prep time, Dasher speed profile, weather. Runs a GBDT model every time an eater loads the tracking page. P90 error target: ~3 minutes.
Interview answer
"Dispatch runs in ~2 s windows as a batched assignment problem. Query H3 geo-index for Dashers within 3 km, score each (order, Dasher) pair with an ML model on distance + prep time + historical signals, solve with Hungarian algorithm to maximize global score. Batch two nearby orders to one Dasher when possible. Offer sent via push with 30 s response window; on reject, re-dispatch. ETA served by a separate GBDT model refreshed on every tracking-page open. Location service uses H3 cells + ring queries, hot data in Redis, history in Cassandra."
07
Tradeoffs & Design Choices
Batched dispatch vs instant. Batching (every 2 s) improves global efficiency but adds latency per order. Instant feels faster but is locally greedy and wasteful. The 2-second window is the sweet spot — imperceptible to eater, big marketplace-efficiency win.
Order-as-saga, not a single transaction. Create order → auth payment → confirm with restaurant → dispatch Dasher — each step can fail. Saga pattern with compensating actions (auto-refund on restaurant decline, re-dispatch on Dasher timeout) beats trying to do it as one distributed transaction.
ETA accuracy vs pessimism. A promised 25 min ETA delivered in 23 min delights the user; a promised 25 min delivered in 28 min makes them angry even though it's "close." ETAs are intentionally padded ~2 min for this UX asymmetry.
H3 cells vs quadtree vs geohash. H3 (Uber's hex grid) beats square grids because hex neighbors are all equidistant, simplifying ring queries. Geohash has edge-of-cell artifacts. DoorDash adopted H3 (originally Uber's library) for dispatch lookups.
Strong consistency where it matters. Order state transitions are sagas with idempotent steps in Postgres; payments are 2-phase (auth + capture); dispatch assignments are Cassandra-style writes with dedup. Browse/search can be eventually consistent (cached menus are fine for 60 s).
08
Failure Modes
🍔
Restaurant accepts, then realizes it's out of a dish
Eater waits 10 min; then gets "item unavailable" and has to pick a substitute or cancel. Broken promise.
→ Mitigation: menu inventory integration with POS so out-of-stock items are auto-hidden; pre-order availability check before confirm; in-app substitution flow with partial refund.
🚗
Dasher GPS ping goes dark mid-delivery
Tunnel, dead battery, bad signal → eater's tracking map freezes; CS complaints spike.
→ Mitigation: client-side UI gracefully shows "Dasher is approaching" without exact position after N seconds of stale data. Don't show the last-known dot forever — it's creepy and misleading.
💳
Payment auth fails after order committed
Order saved; payment processor declines. Restaurant started cooking.
→ Mitigation: payment auth is the first step of the saga, not last. No order state written before auth success. If capture fails later (rare), order proceeds and we collect on retry or debt channel.
→ Mitigation: surge pricing incentives to Dashers; ETA inflation visible to eater at checkout so expectations match reality; temporary caps on ordering in the most impacted areas to prevent total collapse.
🧨
Dispatch service regional outage
Dispatch goes down in a region. Orders pile up in "pending assignment"; restaurants cook food with no one to deliver it.
→ Mitigation: automatic fallback to nearest-Dasher heuristic in-region (no ML model required); circuit-breaker from Order svc to Dispatch so eaters see realistic ETA extensions; graceful degradation better than full failure.
🔁
Duplicate order on network retry
Eater taps "place order" twice during a spinner; phone retries on 504 → two orders.
→ Mitigation: client-generated idempotency key per tap; Order service rejects duplicate keys within 10 minutes; charge once.
✓ Better: 2-second batched window; Hungarian assignment with batching bonus.
🚫
Poll for order status every second
Millions of clients × 1/sec = DDoS on your own API.
✓ Better: WebSocket push or long-poll; status delta events; client pulls on wake.
🚫
One `orders` table with everything
Merchant / Dasher / Rider / Payments all fight over this hot row.
✓ Better: Split into Marketplace / Order / Logistics / Payment services with their own stores.
09
Interview Tips
Lead with dispatch, not with search. The marketplace is boring; dispatch is where the interesting engineering is. Interviewers expect you to pick dispatch as the deep-dive.
Explicitly call out "three-sided." Eaters, merchants, Dashers each have their own app, their own SLA, their own failure modes. Many candidates forget the merchant surface entirely.
Saga over distributed transactions. Don't try to "2PC across restaurant + payment + dispatch." Saga with idempotent compensations is the production pattern.
H3 over geohash. Mention H3 by name — Uber popularized it, DoorDash uses it, Yelp uses it. Shows domain fluency.
ETA is ML, not "nearest(restaurant, address)/avg_speed." Frame it as a model trained on millions of historical deliveries. Distance is one feature among many.