System Design — 016

Ticketmaster / StubHub

Design an online ticket booking platform that handles extreme concurrency spikes — millions of users competing for thousands of seats — without ever selling the same seat twice.

ConcurrencyDistributed LocksInventory ManagementVirtual QueueFlash Sale
01

Problem Statement

Design a system like Ticketmaster or StubHub that allows users to browse events, view interactive seat maps, select and temporarily hold seats, and complete purchases. The system must handle extreme demand spikes — when a popular artist's tickets go on sale, millions of users may compete for tens of thousands of seats within seconds.

Unlike most e-commerce systems where inventory is abundant, ticket booking has a unique constraint: every seat is a unique, non-fungible item. Seat A1 in Row 3 is different from A2. This means the system can never oversell — selling the same seat to two people is a catastrophic failure, not just a data inconsistency.

Core question: How do you let thousands of people compete for a limited number of seats without overselling, while keeping the system responsive and fair?

The Two Fundamental Tensions

Availability vs. Correctness

You want the site to stay up under massive load (availability), but you absolutely cannot double-sell a seat (correctness). This is one of the rare systems where you lean toward consistency over availability — a brief "try again" is acceptable, selling two people the same seat is not.

UX vs. Inventory Accuracy

When a user browses the seat map, should they see real-time availability? If yes, you're hammering your inventory service. If no, they'll try to buy seats that are already gone. The solution: eventually consistent reads with strongly consistent writes.

02

Requirements

Functional Requirements

  • Browse & search events — by artist, venue, date, location, genre
  • View seat map — interactive map showing available seats with pricing tiers
  • Select & temporarily hold seats — when a user picks seats, hold them for ~8 minutes so nobody else can grab them
  • Purchase tickets — complete payment and issue a digital ticket with QR code
  • Cancel / refund — release seats back to inventory on cancellation
  • Virtual waiting room — for high-demand events, manage a fair queue before the sale starts

Non-Functional Requirements

  • Strong consistency on inventory — never oversell a single seat
  • High availability for reads — seat maps, event browsing should always work
  • Low latency under extreme concurrency — checkout flow must work when 100K+ users hit it simultaneously
  • Fairness — first-come-first-served during hot sales, no gaming the queue
  • Idempotent payments — never double-charge a customer

Out of scope: Secondary market (resale), dynamic pricing, recommendation engine, user authentication (assume existing OAuth/JWT).

03

Scale Estimation

We derive numbers from assumptions — the numbers drive the architecture, not the other way around.

500M
Registered Users
~550/day
Events on Sale
11M/day
Total Tickets Available
~2 TB/yr
Ticket Data Storage

The Spike That Drives the Design

Major On-Sale Event (Taylor Swift Scale)

50,000 seats, 5–10M users competing. 90% of seats sell in the first 2 minutes. That's ~375 write TPS for purchases — manageable. But the reads are the killer: 500K+ requests/second as millions refresh the seat map.

Derivation

200K events/year × 20K avg seats = 4B tickets/year. At ~500 bytes per ticket record: ~2 TB/year of ticket data. Event metadata, user profiles, and payment records add 5–10×, but still manageable. The read:write ratio during hot sales is ~1000:1 — this extreme ratio shapes the entire architecture. Reads must be cached aggressively; writes must be serialized per seat.

Parameter Value Drives...
Entrants (hot sale) 5–10M Queue capacity, polling load
Seats per event 50K Total admissions needed
Avg checkout time ~3 min Batch admission rate
Hold timeout 8 min Recapture window, queue drain speed
Inventory svc capacity 10K concurrent Batch size per admission wave
Status poll interval ~5 sec Queue status endpoint QPS (~1M)
04

API Design

Search Events
GET /api/events?query=taylor+swift&location=dubai&date_from=2025-06-01
Authorization: Bearer {jwt}

Response 200:
{
  "events": [
    {
      "event_id": "evt_abc123",
      "name": "Taylor Swift — Eras Tour",
      "venue": "Dubai Arena",
      "date": "2025-09-15T20:00:00Z",
      "seats_available": 32847,
      "price_range": { "min": 150, "max": 950, "currency": "USD" },
      "queue_enabled": true,
      "sale_starts": "2025-06-01T10:00:00Z"
    }
  ],
  "total": 1,
  "cursor": "..."
}
Get Seat Map Availability
GET /api/events/{event_id}/availability
Response 200:
{
  "event_id": "evt_abc123",
  "total_seats": 50000,
  "available": 32847,
  "sections": {
    "SEC-A": { "available": 120, "tier": "platinum", "min_price": 950 },
    "SEC-B": { "available": 380, "tier": "gold", "min_price": 450 },
    "SEC-C": { "available": 1200, "tier": "silver", "min_price": 250 }
  },
  "cached_at": "2025-06-01T10:00:02Z",
  "ttl_seconds": 3
}
// NOTE: This is eventually consistent (2-5s stale). OK for display.
Join Waiting Room
POST /api/queue/join
Body: { "event_id": "evt_abc123" }
Response 200: { "queue_token": "qt_xyz789", "status": "waiting" }

GET /api/queue/status?token=qt_xyz789
Response 200:
{
  "position": 847293,
  "now_serving": 450000,
  "estimated_wait": "12 minutes",
  "status": "waiting"   // waiting | admitted | event_sold_out
}
Hold Seats (Critical Path)
POST /api/inventory/hold
Headers: X-Admission-Token: {queue_admission_token}
Body: {
  "event_id": "evt_abc123",
  "seats": ["SEC-A_ROW-3_SEAT-15", "SEC-A_ROW-3_SEAT-16"],
  "hold_duration_seconds": 480
}
Response 200:
{
  "hold_id": "hold_abc",
  "seats": ["SEC-A_ROW-3_SEAT-15", "SEC-A_ROW-3_SEAT-16"],
  "expires_at": "2025-06-01T10:08:00Z",
  "total_price": 1900.00
}
Response 409: { "error": "seat_taken", "unavailable": ["SEC-A_ROW-3_SEAT-15"] }
Purchase Tickets
POST /api/payments/charge
Headers: Idempotency-Key: {user_id}:{event_id}:{hold_id}
Body: {
  "hold_id": "hold_abc",
  "payment_method_id": "pm_stripe_xyz",
  "total": 1900.00
}
Response 200:
{
  "order_id": "ord_def456",
  "tickets": [
    { "ticket_id": "tkt_001", "seat": "SEC-A_ROW-3_SEAT-15", "qr_code": "..." },
    { "ticket_id": "tkt_002", "seat": "SEC-A_ROW-3_SEAT-16", "qr_code": "..." }
  ],
  "status": "confirmed"
}
05

High-Level Architecture

Every component exists to serve one step of the user journey: search → view seats → queue → hold → pay → confirm. Services are separated by scaling characteristics — read-heavy event browsing scales independently from write-critical inventory holds.

Client Mobile / Web CDN Static + Seat Maps Load Balancer Path Routing Queue Service Waiting Room Event Service Search / Browse Inventory Service Holds + Purchases Payment Service Stripe / Adyen Notification Svc Email / Push Redis Sorted Set (Queue) Elasticsearch Event Search Redis Event Cache Redis SET NX (Holds) PostgreSQL Source of Truth Kafka / SQS Event Bus Static HTTPS Path Routing Fast Gate Authority Confirm ⚡ CRITICAL PATH — NEVER OVERSELL

Component Responsibilities

Component Role Scaling Note
CDN Serves static assets + seat map base images. 90% of traffic never reaches origin. Edge-cached globally
Load Balancer Path-based routing: /api/events/* → Event Svc, /api/inventory/* → Inventory Svc, etc. Ensures hot-path isolation
Queue Service Virtual waiting room for hot events. Randomized position assignment at on-sale time. Controlled batch admission. Redis Sorted Set, stateless workers
Event Service Read-heavy workhorse. Full-text search (Elasticsearch), event details (Redis cache), availability overlay. 50+ instances during hot sales
Inventory Service Seat state machine: AVAILABLE → HELD → SOLD. Two-layer: Redis SET NX (fast gate) + PostgreSQL (authority). Sharded by event_id
Payment Service Decoupled, async. Charges via Stripe/Adyen with idempotency keys. Confirms or releases hold on result. Independent scaling
Notification Service Consumes Kafka events. Generates QR tickets, sends emails/push. Fire-and-forget from payment's perspective. Async, retry from queue

Seat Map: CDN + Lightweight Data Overlay

A 50,000-seat stadium map is a complex visual. Rendering it from scratch per-request would be prohibitively expensive. Instead, the base seat map (venue layout, section boundaries, seat positions) is a pre-rendered static asset served from CDN — think of it as the empty blueprint.

The availability overlay is a lightweight JSON payload fetched separately (~200KB for 50K seats), cacheable for 2–5 seconds. The browser loads the cached image from CDN and overlays colored dots based on the fresh JSON. This turns a heavy rendering problem into a lightweight data-fetch problem.

Request Flow — Step Through
ClientCDNEvent SvcQueue SvcInventory SvcRedis (NX)PostgreSQLPayment SvcKafkaNotification
Click Next Step to walk through the request flow.
06

Deep Dive — Preventing Double-Selling Under Extreme Concurrency

This is THE interesting problem in this design. When 10,000 users click on the same seat within a second, exactly one must win and 9,999 must be told "seat taken" — instantly, with no race conditions.

The Naive Approach (and Why It Fails)

-- Thread 1 and Thread 2 both run this simultaneously
SELECT status FROM seats WHERE seat_id = 'A1' AND event_id = 'E1';
-- Both see: AVAILABLE  ← race condition window

UPDATE seats SET status = 'HELD', user_id = 'U1' WHERE seat_id = 'A1';
-- Thread 1 wins
UPDATE seats SET status = 'HELD', user_id = 'U2' WHERE seat_id = 'A1';
-- Thread 2 ALSO succeeds → DOUBLE SOLD

Two reads happen before either write. Both see "available" and both proceed. This is a classic read-then-write race condition.

The Two-Layer Hold Pattern

The solution uses Redis as a fast gate and PostgreSQL as the authority. Redis rejects 99% of contention without ever touching the database.

sequenceDiagram participant U as User participant IS as Inventory Service participant R as Redis participant PG as PostgreSQL U->>IS: POST /hold (seat A1) IS->>R: SET seat:E1:A1 user_123 NX EX 480 alt Key already exists R-->>IS: nil (FAIL) IS-->>U: 409 Seat Taken else Key set successfully R-->>IS: OK IS->>PG: UPDATE seats SET status='HELD' WHERE status='AVAILABLE' AND version=N alt rows_affected = 1 PG-->>IS: 1 row updated IS-->>U: 200 Hold Confirmed (8 min) else rows_affected = 0 PG-->>IS: 0 rows IS->>R: DEL seat:E1:A1 IS-->>U: 409 Seat Taken end end

Why SET NX Instead of Redlock?

SET NX — Simple Claim

Single atomic command. 1 network round-trip. The NX flag means "only set if not exists." The EX 480 gives an 8-minute TTL. We're claiming, not locking — the hold itself is the state.

Redlock — Overkill

Requires 5 independent Redis masters, 5 round-trips, clock sync assumptions. Designed for mutual exclusion (lock → work → unlock), but we don't have a critical section. Known issues with GC pauses and clock drift (see Kleppmann's critique).

What if Redis Succeeds but DB Fails?

The Redis SET NX succeeds, but the PostgreSQL UPDATE fails (timeout, crash, disk full). Now Redis thinks the seat is held, but the DB thinks it's available — split-brain state.

async def hold_seat(event_id, seat_id, user_id):
    redis_key = f"seat:{event_id}:{seat_id}"
    
    # Phase 1: Fast gate
    acquired = await redis.set(redis_key, user_id, nx=True, ex=480)
    if not acquired:
        return HoldResult.SEAT_TAKEN
    
    # Phase 2: Authoritative write
    try:
        rows = await db.execute("""
            UPDATE seats SET status = 'HELD', user_id = %s, 
                   held_until = NOW() + INTERVAL '8 min',
                   version = version + 1
            WHERE seat_id = %s AND event_id = %s
              AND status = 'AVAILABLE'
        """, [user_id, seat_id, event_id])
        
        if rows == 0:
            await redis.delete(redis_key)  # Clean up
            return HoldResult.SEAT_TAKEN
        return HoldResult.SUCCESS
        
    except Exception:
        await redis.delete(redis_key)  # Roll back Redis
        return HoldResult.RETRY

Defense in Depth — 4 Safety Layers

The Hierarchy of Truth

Layer 1 — PostgreSQL is the source of truth (survives restarts, is ACID).
Layer 2 — Redis is the performance optimization (fast filter, may be stale).
Layer 3 — TTL is the self-healing mechanism (bounds duration of any inconsistency to 8 min).
Layer 4 — Reconciliation job is the safety net (scans every 30s, catches anything TTL hasn't fixed).

"Best Available" — FOR UPDATE SKIP LOCKED

Many users don't pick specific seats — they request "2 best available in Section B." PostgreSQL's FOR UPDATE SKIP LOCKED is perfect:

SELECT seat_id FROM seats
WHERE event_id = ? AND status = 'AVAILABLE' AND price_tier = ?
ORDER BY row_number ASC, seat_position ASC  -- front-center is "best"
LIMIT ?
FOR UPDATE SKIP LOCKED  -- Skip rows locked by other transactions

If 10 people request "best available" simultaneously, they each get different seats without blocking each other. No deadlocks, no waiting.

General Admission — Atomic Counter

For events without reserved seating, per-seat locking is unnecessary. Instead, use a single Redis DECR:

remaining = DECR event:E1:remaining
if remaining >= 0:
    # Purchase succeeds — write to DB async
else:
    INCR event:E1:remaining  # Roll back
    # Sold out
07

Key Design Decisions & Tradeoffs

1. Consistency Model

✓ Chosen

Strong Writes + Eventual Reads

Seat map reads are cached (2–5s stale). A seat might show as "available" but be held when you click it — user gets a clear "seat taken" error. Writes (holds, purchases) are strongly consistent via atomic operations.

✗ Alternative

Fully Real-Time Reads

Every seat map view queries live inventory. Perfectly accurate but 500K+ QPS hammers the DB. Unsustainable during hot sales and no meaningful UX improvement — seats change hands in milliseconds anyway.

2. Queue Activation

✓ Chosen

Conditional Queue (per-event flag)

Only enable the virtual waiting room when expected demand exceeds 10× capacity. Small events go directly to the seat map. Avoids unnecessary friction for 95% of events.

✗ Alternative

Always-On Queue

Every event gets a queue. Simpler architecture (one code path) but adds latency and frustration for small shows where seats are abundant. Nobody wants to wait in line for a 200-seat comedy club.

3. Hold Duration

✓ Chosen

8-Minute Hold with TTL

Balances user checkout comfort with inventory turnover. Redis TTL auto-expires holds without cleanup logic. Short enough that abandoned carts don't lock seats for long.

✗ Alternative

15-Minute Hold

Better UX — users feel less rushed. But during a hot sale with 50K seats, 15-min holds mean ~83K admissions needed. That's 25+ minutes of queue draining vs. ~15 min with 8-min holds. More inventory locked = more lost sales.

4. Data Store for Inventory

✓ Chosen

PostgreSQL (ACID)

ACID transactions guarantee seat state correctness. Optimistic locking with version column. Sharded by event_id. Mature, battle-tested. FOR UPDATE SKIP LOCKED handles concurrent "best available" beautifully.

✗ Alternative

DynamoDB / NoSQL

Better write throughput and auto-scaling. But conditional writes for complex state machines are harder to reason about. You'd need to build consistency guarantees yourself — the wrong tradeoff when correctness is existential.

5. Queue Position Assignment

✓ Chosen

Random Shuffle at On-Sale Time

Everyone who joins before the sale starts gets a random position. Prevents bots from gaming arrival time. Uses cryptographic randomness with a publicly committed seed for auditability.

✗ Alternative

FIFO (First-Come-First-Served)

Feels "fair" intuitively. But in practice, rewards people with faster internet, refresh-spamming, and bot scripts. Not actually fair — just fast. Late arrivals (after on-sale) still get FIFO as a tail.

08

What Can Go Wrong

🔴 Payment Service Goes Down

Hold is already in place, so the seat is safe. The system retries payment within the hold window. If the hold expires before payment succeeds, the seat is released. User must re-select. Mitigation: hold is the safety net — worst case is lost sale, never a double-sell.

🔴 Redis Goes Down

Fall back to PostgreSQL optimistic locking only. Higher latency (5ms vs. <1ms) but still correct. This is why PostgreSQL is the source of truth, not Redis. Feature-flag the Redis layer so it degrades gracefully.

🔴 Hold Expires During Payment Processing

The scariest edge case. User submits payment at minute 7, hold expires at minute 8, payment completes at minute 8:15 — but the seat was released and re-held by someone else. Solution: the confirm_purchase call does an atomic WHERE check — if the seat is no longer held by this user, the payment is refunded immediately.

🔴 Bot Attacks

Bots try to grab hundreds of seats using multiple accounts. Defenses: CAPTCHA at queue entry, device fingerprinting, rate limiting on token generation (one per IP per event), random queue position assignment (bots can't gain speed advantage), second CAPTCHA at admission.

🔴 Hot Partition

A single popular event means all requests hit the same DB shard. Redis absorbs most contention (99% of rejections happen at the SET NX layer). The DB only sees successful holds (~50K writes over 15 minutes). Shard by event_id so other events are unaffected.

🔴 Split-Brain: Redis Says Held, DB Says Available

Redis SET NX succeeds, DB write fails. The rollback logic immediately DELs the Redis key. If even the DEL fails, the 8-minute TTL self-heals — the key auto-expires and the seat becomes available again. Background reconciliation job catches stragglers every 30 seconds.

Anti-patterns

🚫
Optimistic concurrency on seat row

100k people all try to grab seat A12 simultaneously → 99,999 retries.

✓ Better: Pessimistic lock + queue-based waiting room; users enter sequentially.
🚫
Cache seat availability aggressively

TTL of seconds means 10k users see the same seat as available.

✓ Better: Real-time availability; SSE/WebSocket updates; cache only static (event, venue) data.
🚫
One monolithic DB transaction from reserve → payment

Transaction holds seat locks for minutes while user types card info.

✓ Better: Two-phase: soft hold (5 min TTL) then payment; explicit release on timeout.
09

Interview Tips

💡
Lead with the constraint, not the components.
"The core challenge here is preventing double-selling under extreme concurrency. Let me design around that." This immediately shows you understand what makes this problem unique — it's not a generic CRUD app.
Clarify the seating model early.
Ask: "Are we designing for reserved seating (pick your seat) or general admission?" This fundamentally changes the concurrency approach — reserved needs per-seat locking (SET NX), GA needs an atomic counter (DECR).
🎯
The virtual queue is your scaling secret weapon.
Proactively say "we need to protect downstream systems by controlling the admission rate." Interviewers love this — it shows you think about production realities, not just component diagrams.
🧠
Don't forget: the happy path is boring.
The interesting design is in edge cases: hold expiration during payment, Redis/DB inconsistency, queue fairness under bot attacks. Volunteer these — "let me talk about what happens when things go wrong."
🔑
Know the magic words: SET NX, FOR UPDATE SKIP LOCKED, idempotency key.
These three techniques — Redis atomic set-if-not-exists, PostgreSQL row-level skip-locking, and payment idempotency — are the concrete implementation details that turn a hand-wavy design into a credible one.
📊
State the read:write ratio.
"During a hot sale, reads outnumber writes ~1000:1. This means I can serve reads from a cache with 2–5 second staleness, while writes go through an atomic path." This single sentence justifies the entire caching strategy.
11

Evolution

How this design grows from a single-server prototype to a planet-scale ticketing platform.

1

MVP — Single Server, Small Venues

Single PostgreSQL database with optimistic locking (version column). No Redis, no queue. Direct seat selection → payment. Works for venues up to ~5,000 seats where concurrent demand is manageable. Simple, correct, and easy to reason about.

2

Growth — Redis + Virtual Queue

Add Redis as a fast gate for seat holds (SET NX). Introduce the virtual waiting room for events with expected demand > 10× seat count. Add read replicas for seat map queries. Elasticsearch for event search. CDN for static assets and seat map images. Handles events up to ~50,000 seats.

3

Scale — Sharding + Global Reach

Shard PostgreSQL by event_id so hot events don't affect others. Async payment processing with idempotency keys. Multi-region deployment (CDN edge + regional API servers). Kafka event bus for decoupled notification and analytics. Dynamic queue batch sizing based on real-time inventory service load. Handles multiple simultaneous hot sales globally.

4

Platform — Secondary Market + Dynamic Pricing

Add verified resale marketplace (separate service layer on top of primary). Dynamic pricing service that adjusts prices based on demand signals from the queue. Mobile-first ticket delivery with NFC/Apple Wallet. Analytics pipeline for venue operators. Fraud detection ML for bot prevention. Transfer and gifting capabilities.

Next up