Yelp / Google Places

A local business discovery platform where users search for nearby restaurants, shops, and services by location, category, and filters. The core challenge is answering "what is near me?" across 100 million listings in under 100ms — a fundamentally two-dimensional indexing problem that standard B-tree databases cannot solve out of the box.

⚡ Core: Geospatial Indexing50M DAU100M BusinessesRead-heavyProximity SearchHigh Availability
02

Requirements

Functional
  • Search businesses by location + keyword + filters — radius, category, rating, price, open now
  • View business detail — name, address, hours, rating, photos, reviews
  • Write a review with star rating and optional photo upload
  • Add or update a business listing
  • Upload photos attached to a business or review
Non-Functional
  • Search results under 100ms p99 latency
  • Business detail page under 50ms p99 — highly cacheable
  • 99.99% availability on read paths
  • Eventual consistency acceptable — ratings can lag seconds
  • Photos served via CDN — never from origin servers
Key Insight

Yelp is a discovery and reference tool, not a transactional system. People search far more than they write reviews. This means reads are the critical path — every architecture decision should optimise read latency first, and tolerate eventual consistency on the write path.

03

Scale Estimation

Search & business data

MetricCalculationResult
Daily Active UsersIndustry estimate50M
Searches / user / day~5 per session250M/day
Search RPS (avg)250M ÷ 86,400~2,900
Search RPS (peak 3×)2,900 × 3~8,700
Business data total100M × 2KB per listing~200GB

Reviews & photos

MetricCalculationResult
Reviews written / day50M DAU × 0.1 review/user5M/day
Review write RPS5M ÷ 86,400~58 RPS
Review storage / year5M × 500B × 365~900GB/yr
Photos uploaded / day5M reviews × 2 photos avg10M/day
Photo storage / year10M × 500KB × 365~1.8PB/yr
Three Numbers That Drive Everything

200GB of business data fits entirely in Redis — cache aggressively. Review writes at 58 RPS are trivial — don't over-engineer the write path. Photos at 1.8PB/year must go to object storage — they can never touch your API servers.

04

API Design

GET/v1/searchProximity + filter search
// Query params ?q="sushi" &lat=37.7749 &lng=-122.4194 &radius=5 &category="japanese" &rating=4.0 &price=2 &open_now=true &sort_by="relevance" &page=1 &limit=20 // Response 200 — IDs + thumbnail only, full detail on second tap { "results": [{ "business_id": "abc123", "name": "Nobu SF", "rating": 4.6, "distance_miles": 0.8, "is_open": true, "thumbnail_url": "https://cdn.yelp.com/photos/abc123/thumb.jpg" }], "total": 847, "page": 1 }
GET/v1/businesses/:idFull business detail page
// Response 200 — full hours, photos, contact info included { "business_id": "abc123", "rating": 4.6, "review_count": 2340, "hours": { "monday": { "open": "12:00", "close": "22:00" }, ... }, "timezone": "America/Los_Angeles", "photos": ["https://cdn.yelp.com/photos/abc123/1.jpg"] }
POST/v1/businesses/:id/reviewsWrite a review
// Request — photo_ids reference pre-uploaded S3 objects { "rating": 5, "text": "Best omakase in the city...", "photo_ids": ["ph001"] } // Response 201 Created { "review_id": "rev789", "created_at": "2026-03-24T19:45:00Z" }
POST/v1/photos/upload-urlGet pre-signed S3 upload URL
// Client uploads DIRECTLY to S3 — photo bytes never touch API servers { "photo_id": "ph001", "upload_url": "https://s3.amazonaws.com/yelp/ph001?X-Amz-Sig=...", "expires_in": 300 }
Seven Endpoints, Full Product

GET /search · GET /businesses/:id · POST /businesses · PUT /businesses/:id · GET /businesses/:id/reviews · POST /businesses/:id/reviews · POST /photos/upload-url. The search response returns business IDs and thumbnails only — full detail fetched on second tap, keeping the search response payload small.

05

High-Level Architecture

Architecture — High Level SVG Diagram
CLIENT Browser / Mobile INFRA Load Balancer COMPUTE Search Service Geohash · Filter · Rank COMPUTE Business Service CRUD · Hours · Pre-sign COMPUTE Review Service Write Reviews · Publish Events CACHE Redis Search · Business docs STORAGE Business DB PostgreSQL + Geohash idx STORAGE S3 + CDN Photos · Pre-signed URL QUEUE Kafka review.created events CONSUMER Rating Aggregator Incremental sum/count update STORAGE Review DB PostgreSQL Synchronous Async / cache
Search Service

The heart of Yelp. Converts lat/lng → geohash, queries the 9-cell neighbor grid, applies filters in memory, ranks the candidate set by relevance, and returns results. Reads primarily from Redis — the DB is hit only on a cold cache miss.

Business Service

Handles all CRUD for business listings. Also issues pre-signed S3 URLs for photo uploads — meaning photo bytes never pass through this service. It issues the ticket; S3 and CDN do all the heavy lifting from there.

Kafka + Rating Aggregator

Review writes publish an event to Kafka immediately. The aggregator consumes it and atomically increments total_rating_sum and review_count on the business row. O(1) update regardless of how many reviews exist.

Redis Cache (Two Namespaces)

search:<geohash>:<filters> → business ID list (TTL 5 min). business:<id> → full document (TTL 1 hr). Together these absorb the vast majority of all reads without touching PostgreSQL.

06

Deep Dive — Geohashing & the 9-Cell Query

Why This Is The Hard Part

A B-tree index is one-dimensional. Latitude and longitude together describe a two-dimensional point. You can't efficiently satisfy both constraints simultaneously with a standard index — so you need to collapse 2D space into 1D while preserving geographic proximity. That's exactly what geohashing does.

Sequence — Full Search Request Lifecycle Mermaid.js
sequenceDiagram participant C as Client participant LB as Load Balancer participant SS as Search Service participant RC as Redis Cache participant DB as Business DB C->>LB: GET /v1/search?q=sushi&lat=37.77&lng=-122.41 LB->>SS: Route to search service instance SS->>SS: lat/lng → geohash 9q8y0 SS->>SS: Compute 8 neighbor cells SS->>RC: GET search:9q8y0:sushi:rating4+ alt Cache Hit RC-->>SS: [business_id list] (under 1ms) else Cache Miss SS->>DB: SELECT WHERE geohash IN (9 cells) DB-->>SS: ~500-2000 candidate businesses SS->>SS: Filter by category, rating, price SS->>SS: Compute exact distance, discard out-of-radius SS->>SS: Rank by relevance + distance score SS->>RC: SET search:9q8y0:sushi:rating4+ TTL=300 end SS->>RC: MGET business:abc123, business:def456 ... RC-->>SS: Full business documents (TTL 1hr) SS-->>C: 200 OK — top 20 results

Geohashing gives every point on Earth a short alphanumeric string where geographic proximity maps to shared string prefixes. Two businesses on the same city block might both start with 9q8y. A restaurant in New York starts with dr5r — a completely different prefix. This makes proximity search a simple prefix lookup, which a B-tree index handles perfectly.

The 9-cell neighbor query exists to fix the one weakness of this approach: if you're standing right on a cell boundary, the nearest business may be just across the line in the adjacent cell. By always querying your home cell plus all 8 surrounding neighbors, you guarantee no business within your radius is ever missed — regardless of where within the cell you happen to be standing.

After the geohash query returns a candidate set of 500–2000 businesses, two more steps run in memory: exact distance filtering (compute true haversine distance, discard anything outside the user's actual radius) and ranking (score by a weighted blend of distance, star rating, review count, and keyword relevance). Both steps are cheap because they run on a small candidate set, not across 100M rows.

The "open now" filter breaks caching because its answer changes every minute at business closing times. The fix: cache the broader result set without the open-now constraint, then apply it in application code using the hours schedule already embedded in each cached business document. This is fast, cheap, and requires no cache invalidation logic.

07

Key Design Decisions & Tradeoffs

Option A — Chosen
Geohashing

Converts 2D coordinates into a prefixable string. Works with any standard B-tree index. Simple to implement, trivially cacheable. The 9-neighbor query handles boundary edge cases with a minor overhead of 9 indexed lookups instead of 1.

✓ Simple · cacheable · battle-tested
Option B
PostGIS / R-Tree Spatial Index

Native 2D spatial indexing in PostgreSQL. Handles true radius queries and arbitrary polygon areas with no boundary workaround needed. Tradeoff: operationally heavier, harder to cache against, exposes more query complexity than Yelp's use case requires.

~ Only if you need polygon-area search
Option A — Chosen
Cache on Geohash Cell

Cache key includes the geohash prefix, not raw lat/lng. Every user within the same city block shares a cache entry. Hit rate goes from ~0% on raw coordinates to something meaningful. "Open now" is filtered in the application layer after retrieval.

✓ High hit rate · low DB pressure
Option B
Always Query Live

Guaranteed freshness, zero cache infrastructure. At 8,700 peak RPS hitting PostgreSQL directly, you need significantly more DB capacity and replicas. Fine at small scale — painful once you hit millions of DAU and the same 200GB is queried repeatedly.

~ Fine below 1M DAU
Option A — Chosen
Incremental Rating Aggregation

Store total_rating_sum and review_count on the business row. On each review write, atomically increment both. O(1) update regardless of how many reviews exist. No batch job, no recomputation, no meaningful lag.

✓ Start here — simple and correct
Option B
Flink Streaming Aggregation

Windowed aggregations, weighted recency scoring, and fraud detection built into the pipeline. Powerful for complex models. Tradeoff: you now operate Kafka + Flink + consumer logic — significant complexity for a workload that is only 58 writes/second.

~ Add for weighted or fraud-aware ratings
Option A — Chosen
PostgreSQL for Business Data

ACID guarantees, geohash index support, and complex multi-column filter queries in a single statement. At 200GB of business data, this fits comfortably on a few well-provisioned nodes with read replicas absorbing the read load.

✓ SQL required for multi-filter queries
Option B
Cassandra / DynamoDB

Massive write throughput and linear horizontal scaling. But: no native geospatial support, no multi-column filter queries, no joins. The write volume — 58 RPS — doesn't justify any of this complexity. Let the actual numbers drive the decision.

✗ Wrong fit for this workload
08

What Can Go Wrong

🔥
Cache Stampede on Viral Business

A restaurant gets featured on a popular food show. Millions of requests arrive for the same business page simultaneously. The Redis entry expires. Every request misses and hits PostgreSQL at once before the cache repopulates. Latency spikes, potentially cascading.

→ Fix: Mutex lock on cache miss + TTL jitter (±10%) to stagger expirations across the fleet
📍
Geohash Cell Hotspot

Times Square, the Las Vegas Strip — extremely dense urban areas where thousands of businesses share the same geohash cell. Your 9-cell query returns 50,000 businesses instead of 500. The in-memory filter step that was cheap at normal density becomes expensive.

→ Fix: Adaptive precision — 7-char geohash (smaller cells) in dense areas, 5-char in sparse rural areas
Review Written, Aggregation Lost

User submits review → written to reviews table → Kafka event published → rating aggregator crashes before processing. Now the review exists but the business rating is permanently stale. A naive restart could double-count the review.

→ Fix: Commit Kafka offset only after successful DB write + idempotent writes using review_id deduplication table
😨
"Open Now" Serving Closed Businesses

A business closes at 9 PM. Cached search results from 8:58 PM still include it as open until the 5-minute TTL expires. Users navigate to a closed restaurant — broken trust, especially for late-night searches when this filter matters most.

→ Fix: Cache without open_now constraint, filter in application layer using hours schedule in the cached business document
📷
Pre-Signed URL Abuse

You issue a 5-minute pre-signed S3 URL. Nothing stops a client from uploading a malicious file type — an executable disguised as a JPEG — or sharing the URL externally before it expires.

→ Fix: Short expiry (5 min) + server-side validation after upload (MIME check, virus scan, resize) + only serve the processed version via CDN, never the raw upload
09

Interview Tips

01

Lead with the geospatial problem immediately. Most candidates say "index on lat/lng." Get ahead of this yourself: say "a naive lat/lng index fails because B-trees are one-dimensional — we need geohashing to collapse 2D space into a prefixable 1D key." Interviewers are waiting for you to recognise this. Don't wait to be prompted.

02

Know the geohash precision table. Interviewers love asking "what precision would you use?" A 6-character geohash covers roughly a city block — that is the right answer for most proximity search problems. 5-char for wider area searches, 7-char for dense urban environments.

03

The "open now" question is almost guaranteed. Have the answer ready before they ask: cache without the constraint, filter in application layer using the hours schedule embedded in the cached business document. "Just use a short TTL" is the wrong answer — it still serves stale results at closing time boundaries.

04

Separate retrieval from ranking explicitly. State this out loud: geohash query gets you a candidate set of ~500 businesses in under 10ms. Ranking runs after on that small set and is cheap. Two distinct phases, two distinct optimisations. Most candidates conflate these and confuse themselves.

05

Know the pre-signed URL pattern cold. If asked about photo uploads: "We never route photo bytes through our API servers. We issue a pre-signed S3 URL, the client uploads directly to S3, we validate server-side after upload." This shows you think about bandwidth and infrastructure costs, not just correctness.

06

Don't over-engineer the write path. Reviews write at 58 RPS. Candidates often reach for Cassandra for "user-generated content at scale." The right answer is PostgreSQL — 58 RPS is trivial for it. Let the actual numbers drive the decision, not the intuition that reviews always equals high write volume.

11

How the Design Evolves

Phase 1 — 0 to 10K users
Monolith + PostgreSQL with PostGIS

Single server, single database with PostGIS for geospatial queries. No geohashing needed yet — PostGIS handles the load cleanly. No cache, no queue, no CDN. Ship the product fast and validate it works.

Phase 2 — 10K to 1M users
Add Redis + Geohash Index + CDN

Switch to geohash-based indexing for cacheability. Add Redis in front of search and business reads. Move photos to S3 immediately — this is non-negotiable at any meaningful scale. Add a read replica. Most Yelp competitors live here their whole lives.

Phase 3 — 1M to 50M users
Microservices + Kafka + Horizontal Scale

Split search, business, and review into separate stateless services. Introduce Kafka for the review write path — enables the incremental aggregator pattern and opens the door to downstream analytics consumers. Shard the Business DB by geohash region to distribute load geographically.

Phase 4 — 50M+ users
Multi-Region + Elasticsearch + Flink

Multi-region active-active for latency. Add Elasticsearch as a dedicated search index — richer keyword matching, faceted filtering, BM25 relevance scoring that PostgreSQL can't match. Flink for real-time weighted ratings and review fraud detection at scale. Global CDN footprint for photo delivery.

Next up