Stock Trading Platform — System Design

01

Problem Statement

Design a stock trading platform that allows users to buy and sell stocks, view real-time market data, and manage their portfolio. The system must handle 25M registered users with 5M daily active users and 2M concurrent WebSocket connections during market hours.

This system has two fundamentally different halves: a read-heavy, latency-sensitive market data pipeline and a write-heavy, correctness-critical order execution path. The architecture must treat these as separate planes with different consistency guarantees.

Core question: How do you build a system where the trading path demands CP (consistency over availability) while the market data path demands AP (availability over consistency) — and both share a user identity?

02

Requirements

Functional Requirements

Core Trading

Place orders — market, limit, stop-loss, stop-limit
Cancel/modify pending (unfilled) orders
Support for fractional shares (buy $5 of AAPL instead of 1 full share)
Extended hours trading (pre-market 4am–9:30am, after-hours 4pm–8pm ET)

Portfolio & Account

View portfolio — holdings, real-time P&L, cost basis
Portfolio equity chart over time (the iconic line graph)
Deposit/withdraw funds (ACH bank transfer, 1-3 day settlement)
Buying power calculation — settled cash + unsettled funds (T+1) + instant deposit credit + margin − pending reserves − regulatory holds

Market Data

Real-time price tickers (bid/ask/last/volume)
Historical candlestick data (1min, 5min, 1hr, 1day candles)
Order book depth (Level 2 — top 5 bids/asks)

Compliance & Regulatory

Pattern Day Trader (PDT) detection — FINRA rule: 4+ day trades in 5 business days with account < $25K restricts trading
Tax lot tracking for 1099 generation
KYC/AML — identity verification at signup
Immutable audit log — SEC Rule 17a-4 requires 7+ year retention

Non-Functional Requirements

Consistency & Correctness

Exactly-once order execution — no duplicate fills, ever
Atomic balance updates — cash debit and share credit together or not at all
Read-after-write guarantee on order path
T+1 settlement tracking (since May 2024)

Availability & Latency

Order acknowledgment < 200ms
Market data latency < 500ms exchange-to-screen
99.99% uptime during market hours (~39 min downtime/year max)
Graceful degradation — trading works if market data is down, and vice versa

03

Scale Estimation

All numbers derived from 25M registered users, 5M DAU, 20% active traders. Market hours are 6.5 hrs/day, 252 trading days/year. Traffic is extremely spiky — 25% of daily volume in the first 30 minutes.

3K/sec

Peak Orders/sec

2M

WebSocket Connections

500K/sec

Market Data Ticks In

1.1 TB/yr

Order Storage

1.1 TB/day

Tick Data Raw

3K/sec

Portfolio Reads Peak

40+

WebSocket Servers

~10K

Active Symbols

Derivation

5M DAU: 20% active traders (1M) × 4 orders/day = 4M orders/day. Average: 170/sec across 6.5 hrs. Market-open spike (25% of volume in 30 min) = 555/sec. Burst within open (3-5×) = 1,500-2,800/sec. Design target: 3K/sec with headroom. GameStop-style events can hit 10K/sec.

2M concurrent users × 80% with app open = 1.6M WebSocket connections, rounded to 2M. At 50K connections/server = 40 servers. Market data fan-out after batching: ~50K msg/sec through Redis, fan-out to connections at WebSocket server level.

Key insight: Market data fan-out — not orders — is the actual scalability bottleneck. 2M users × 20 symbols × 5 updates/sec = hundreds of millions of messages/sec outbound. This drives the WebSocket tier sizing.

04

API Design

Four API domains with different consistency requirements. Every mutating endpoint requires X-Idempotency-Key header. All prices as strings (not floats) to avoid rounding errors.

Trading API

Place Order

POST /v1/orders
Headers: Authorization: Bearer <token>, X-Idempotency-Key: <uuid>

{
  "symbol": "AAPL",
  "side": "buy",                    // buy | sell
  "type": "limit",                  // market | limit | stop | stop_limit
  "quantity": 10,                   // OR "notional": "500.00" (fractional)
  "limit_price": "185.50",          // string, not float
  "time_in_force": "day",           // day | gtc | ioc | fok
  "extended_hours": false
}

→ 201 { order_id, status: "pending_new", reserved_amount: "1855.00", etag: "v1" }

Errors: 422 INSUFFICIENT_BUYING_POWER | 422 SYMBOL_HALTED
        403 ACCOUNT_RESTRICTED (PDT) | 409 DUPLICATE_ORDER

Cancel & Modify

DELETE /v1/orders/{order_id}
→ 200 { status: "pending_cancel" }   // NOT "cancelled" — async confirmation

PATCH /v1/orders/{order_id}
Headers: If-Match: "v1"              // optimistic lock via ETag
{ "limit_price": "184.00" }
→ 200 { status: "pending_replace", etag: "v2" }

Market Data — WebSocket

Real-time Streaming (single connection, multiple channels)

WebSocket /v1/market/stream

Client → Server: { "action": "subscribe", "channel": "quotes", "symbols": ["AAPL","TSLA"] }
Server → Client: { "channel": "quotes", "symbol": "AAPL", "bid": "187.42",
                    "ask": "187.44", "last": "187.43", "volume": 48293017 }
Server → Client: { "channel": "orders", "order_id": "ord_abc", "status": "filled",
                    "avg_fill_price": "185.32" }
Server → Client: { "channel": "heartbeat", "ts": "..." }   // every 30s

Portfolio API

Portfolio Snapshot

GET /v1/portfolio → 200
{
  "equity": "47832.50",
  "cash": { "settled": "10800.50", "unsettled": "1500.00", "withdrawable": "10800.50" },
  "buying_power": "24600.00",
  "day_trades_remaining": 3,
  "positions": [
    { "symbol": "AAPL", "quantity": "10.000000", "avg_cost": "185.32",
      "current_price": "187.45", "unrealized_pnl": "21.30" }
  ],
  "as_of": "2025-06-10T14:30:05Z"  // snapshot timestamp
}

Design principle: POST /orders returns acknowledgment, not fill confirmation. Fills arrive asynchronously via WebSocket — because the order goes through validation, risk checks, and exchange routing before execution.

05

High-Level Architecture

Two-plane architecture: Trading Plane (CP) optimized for correctness and exactly-once execution, and Market Data Plane (AP) optimized for throughput and low latency. They connect at the WebSocket Gateway and Portfolio Service.

Infrastructure Sizing

Component	Instances	Specs
API Gateway	4-6	4 CPU, 8GB RAM
Order Service	6-10	4 CPU, 16GB RAM
Execution Engine	4-8	8 CPU, 16GB RAM
Fill Processor	4-6	4 CPU, 8GB RAM
Portfolio Service	8-12	4 CPU, 16GB RAM
WebSocket Gateway	40-60	4 CPU, 32GB RAM
PostgreSQL Primary	1	32 CPU, 128GB RAM, NVMe
Redis Cluster	6 nodes	16GB RAM each
Kafka Cluster	5 brokers	8 CPU, 32GB RAM, SSD

Request Flow — Step Through

Client→API Gateway→Order Service→PostgreSQL→Kafka→Execution Engine→Exchange (FIX)→Fill Processor→Portfolio DB→WebSocket Push

Click Next Step to walk through the request flow.

06

Deep Dives — 10 Rabbit Holes

Each rabbit hole explores a critical subsystem at implementation depth. Click to jump to any section.

01Order Lifecycle & Exactly-Once 02Buying Power State Machine 03Market Data Fan-Out at Scale 04Smart Order Routing & Reg NMS 05Reconciliation & Settlement 06WebSocket at 2M Scale 07Risk Engine & Circuit Breakers 08Portfolio Equity Curves 09Idempotency & Failure Recovery 10Regulatory Audit Trail

01

Order Lifecycle & Exactly-Once Execution

Every order is a finite state machine with defined transitions. Any transition not in the map is rejected. State is derived from an append-only event log, not mutable status columns.

stateDiagram-v2 [*] --> CREATED CREATED --> PENDING_NEW: Publish to Kafka PENDING_NEW --> SUBMITTED: FIX NewOrderSingle SUBMITTED --> PARTIALLY_FILLED: Exchange partial fill SUBMITTED --> FILLED: Exchange full fill SUBMITTED --> REJECTED: Exchange rejects SUBMITTED --> PENDING_CANCEL: User cancels SUBMITTED --> EXPIRED: Market closes (TIF=day) PARTIALLY_FILLED --> PARTIALLY_FILLED: More fills PARTIALLY_FILLED --> FILLED: Final fill PARTIALLY_FILLED --> PENDING_CANCEL: User cancels remainder PENDING_CANCEL --> CANCELLED: Cancel confirmed PENDING_CANCEL --> FILLED: Filled before cancel arrived

Event Sourcing — How State Is Stored

We never do UPDATE orders SET status = 'FILLED'. Instead, every state change is an immutable event appended to the order_events table. Current status is derived from the latest event. This gives FINRA-compliant audit trails for free.

INSERT INTO order_events (order_id, event_type, fill_qty, fill_price, venue, ts)
VALUES ('ord_abc', 'PARTIAL_FILL', 4, 187.30, 'XNYS', NOW());

-- Performance optimization: materialized status column
-- Updated in same transaction as event insert
BEGIN;
  INSERT INTO order_events (...) VALUES (...);
  UPDATE orders SET status = 'FILLED', filled_qty = 10,
    avg_fill_price = 187.42, updated_at = NOW()
  WHERE id = 'ord_abc';
COMMIT;
-- Event log is source of truth; status column is read optimization

The Exactly-Once Problem — Three Boundaries

Boundary 1: Client → Gateway

Client retries POST /orders due to timeout. Gateway checks idempotency key in Redis — if exists, returns cached response. PostgreSQL UNIQUE constraint on orders.idempotency_key is the definitive backstop.

Boundary 2: Order Service → Kafka

Order committed to DB but Kafka publish fails. Transactional Outbox pattern: write order + outbox row in same transaction. Outbox relay publishes to Kafka. Alternative: Debezium CDC reads PostgreSQL WAL and auto-publishes.

Boundary 3: Execution Engine → Exchange

FIX message sent, no response, engine retries. Exchange rejects duplicate ClOrdID. FIX sequence number recovery resends missed messages on reconnect.

Partial Fills

A limit order for 100 shares may fill as 30 + 20 + 50 across different venues. Each partial fill triggers an atomic update: insert event, update position with weighted average cost, release proportional reserve, debit actual fill cost.

-- Weighted average cost update on each fill
new_avg_cost = (old_avg_cost × old_qty + fill_price × fill_qty) / (old_qty + fill_qty)

-- Reserve release per partial fill
pro_rata_reserve = (fill_qty / order_qty) × total_reserved
released = pro_rata_reserve - actual_fill_cost
-- Net: user gets back excess reserve as buying power

02

Buying Power as a Derived State Machine

Buying power looks like a simple number but is a real-time derived state that depends on 6 different data sources: settled cash, unsettled cash (T+1), instant deposit credit, margin allowance, pending order reserves, and regulatory holds.

buying_power = settled_cash
             + unsettled_cash        // from recent sells, T+1 settlement
             + instant_deposit_credit // ACH pending, capped by tier
             + margin_allowance       // if margin account, 2x leverage
             - reserved_for_pending   // buy orders lock funds
             - maintenance_minimum    // margin maintenance requirement
             - regulatory_holds       // PDT restrictions, GFV limits

T+1 Settlement

Sell AAPL on Monday → cash is "unsettled" until Tuesday. Unsettled cash CAN buy stocks but CANNOT be withdrawn. Buying with unsettled funds then selling before settlement = Good Faith Violation. Three GFVs in 12 months restricts account to settled-cash trading for 90 days.

The ACH Bounce Nightmare

Worst-case edge flow

Day 1: User deposits $5,000 via ACH, gets instant credit, buys $4,800 of TSLA. Day 3: ACH bounces. System must reverse $5,000 credit → user has negative buying power. Account frozen, user has 5 days to deposit or positions are force-liquidated.

Event-Driven Materialized View

Instead of recomputing from scratch, maintain a running snapshot updated incrementally on every relevant event (order created, filled, cancelled, ACH status change, settlement, price tick for margin). Kafka Streams consumer with RocksDB state store writes to Redis for the fast-path check. Order Service does pessimistic revalidation in PostgreSQL inside the transaction.

// Fast path (optimistic, no DB lock)
bp = redis.get("bp:" + userId)
if (bp < orderAmount) → reject immediately (no DB hit needed)

// Slow path (inside transaction, authoritative)
BEGIN;
  SELECT buying_power FROM accounts WHERE id = $1 FOR UPDATE;
  -- revalidate against actual data, then proceed
COMMIT;

03

Market Data Fan-Out at Scale

The actual scalability bottleneck. 100K-500K ticks/sec inbound from exchanges, fan out to 2M concurrent subscribers with different watchlists.

Inbound: Exchange Feed Processing

CTA (NYSE-listed) and UTP (NASDAQ-listed) feeds deliver binary market data. The ingestion service parses at < 5μs per message, maintains an in-memory NBBO book per symbol, and only publishes downstream when the NBBO actually changes — cutting downstream load by ~80%.

// NBBO change filter — critical optimization
On each quote update from exchange:
  Update bid/ask from reporting exchange
  Recompute best_bid = max(all_exchange_bids)
  Recompute best_ask = min(all_exchange_asks)
  If NBBO changed → PUBLISH market:{symbol} to Redis
  If NBBO unchanged → skip (saves 80% of publishes)

Fan-Out: Redis Pub/Sub → WebSocket Servers

Each WebSocket server subscribes to Redis channels only for symbols its connected users watch. When AAPL ticks, Redis sends one message to each subscribing server, which then multicasts to all local connections watching AAPL.

✓ Chosen

Redis Pub/Sub

Fire-and-forget. <1ms latency. If subscriber is slow, messages drop — acceptable for ticks. Latest value always wins.

✗ Alternative

Kafka for Market Data

Persistent, guaranteed delivery. But 5-20ms latency, and stale ticks queuing up is worse than dropping them.

WebSocket Server Internal Fan-Out

Hot path: Redis delivers one AAPL message → server looks up HashMap<Symbol, Set<ConnectionId>> → maybe 15,000 connections need it → serialize once → write binary frame to 15,000 TCP buffers. Server-side batching at 50ms windows deduplicates multiple ticks for the same symbol and uses writev() gathering writes to reduce syscalls.

Per-Server Numbers

50K connections, top 10 symbols at 15K subscribers each × 5 ticks/sec = 750K writes/sec. Mid-tier 1000 symbols at 500 subscribers × 2 ticks/sec = 1M writes/sec. Total: ~1.75M write operations/sec per server, ~175 MB/sec outbound. Requires 2 Gbps NIC with headroom.

04

Smart Order Routing & Reg NMS

When a user submits "BUY 100 AAPL MARKET," the order could go to 13 US exchanges, multiple market makers (Citadel, Virtu), or dark pools. The Smart Order Router (SOR) selects the best venue.

Reg NMS — The Legal Constraint

Rule 611 (Order Protection Rule): you cannot execute a trade at a price inferior to the National Best Bid/Offer (NBBO). If NYSE offers AAPL at $187.42 and you route to Citadel who fills at $187.45, you violated Reg NMS. The SOR must compare all venue quotes against the NBBO before routing.

SOR Decision Engine

Scoring per venue (weighted):
  Price improvement:     40% — filling below NBBO saves customer money
  Historical fill rate:  25% — what % of orders at this venue actually fill?
  Speed:                 15% — how fast does the venue respond?
  PFOF rate:             10% — payment for order flow revenue
  Stability:             10% — reject rate penalty

PFOF Economics — How Robinhood Makes Money

Revenue model

Market makers pay $0.002-$0.004/share for retail order flow. 4M orders/day × 20 shares avg × $0.003 = $240K/day from equities. Options PFOF at $0.40-$0.60/contract is 2-3× more profitable. Total PFOF revenue: ~$150-200M/year. Market makers profit because retail flow is "uninformed" — less likely to be followed by adverse price moves.

05

Reconciliation & Settlement

The safety net that catches every edge case the real-time systems miss. Runs daily after market close.

End-of-Day Reconciliation (5:00 PM ET)

Step 1: Pull execution reports from all brokers (FIX drop copy / API)
Step 2: Pull internal order_events for the day
Step 3: Match by ClOrdID
Step 4: Compare each pair — qty, price, venue, status
Step 5: Classify discrepancies:
  Cat A — Missing internally (most critical): reconstruct from exchange data
  Cat B — Missing at exchange (suspicious): quarantine, investigate
  Cat C — Price/qty mismatch: auto-correct if < $0.01, else manual review
  Cat D — Status mismatch: update our status, release held reserves
Step 6: Report — typically 99.999% perfect match rate

Corporate Actions

Stock splits, reverse splits, dividends, mergers, and spinoffs change positions overnight without user action. A corporate actions processor runs nightly: cancels pending orders for affected symbols, adjusts position quantities and cost basis, creates audit events, and notifies users. Stock splits also require retroactive adjustment of historical price data and portfolio snapshots.

06

WebSocket Connection Management at 2M Scale

Memory Per Connection

TCP socket buffers:     ~32 KB (tuned down from 128KB default)
TLS session state:      ~4 KB
WebSocket frame buffer: ~8 KB
Application state:      ~3 KB (metadata + subscriptions + ring buffer)
─────────────────────────────────
Total: ~47 KB per connection (tuned) to ~160 KB (default)
Per server (50K conns): 8-16 GB RAM for connections alone

L4 vs L7 Load Balancing

✓ Chosen

L4 (NLB / HAProxy TCP)

Forwards TCP packets directly. Single connection client→server. LB tracks only {src:port → dst:port} mapping (~100 bytes each). 2M × 100 bytes = 200 MB — trivial.

✗ Alternative

L7 (ALB / Nginx)

Terminates WebSocket, creates new connection to backend. Doubles connection count (LB holds 2M + servers hold 2M). 320 GB RAM on LB alone. LB becomes throughput bottleneck.

Reconnection Storm Mitigation

When WS-17 crashes, 50K clients reconnect simultaneously. Without mitigation: thundering herd. Solution: client jitter (random 0-10s delay), server-side rate limiting (max 500 new connections/sec), LB health checks (overloaded servers marked unhealthy). Graceful deploys send {"action":"reconnect","delay_ms":random(0,30000)} before draining.

Order Event Routing to Correct Server

Connection registry pattern: on connect, HSET ws:connections {user_id} {server_id} in Redis. Order event consumer does one Redis lookup per fill to route to the correct WebSocket server via internal channel.

07

Risk Engine & Circuit Breakers

Pre-Trade Risk Pipeline (7 checks, < 20ms)

#	Check	Failure
1	Buying power sufficient	INSUFFICIENT_BUYING_POWER
2	Position limits ($500K max per symbol)	POSITION_LIMIT_EXCEEDED
3	Order rate limits (60/min, $1M/min notional)	RATE_LIMIT_EXCEEDED
4	Price reasonability (limit within 10% of market)	PRICE_UNREASONABLE
5	Symbol tradability (not halted, session valid)	SYMBOL_HALTED
6	PDT rule (day trades remaining > 0)	PDT_RESTRICTION
7	Good Faith Violation prevention	GFV_WARNING

Real-Time Margin Monitoring

Continuously compute margin_ratio = equity / maintenance_requirement for every margin account. Thresholds: >2.0 healthy, 1.5-2.0 warning (push notification), 1.0-1.5 margin call (restrict buying, 2-5 day grace), <1.0 forced liquidation (auto-sell most liquid positions first), <0.75 emergency liquidation (sell everything). Watch list optimization: only accounts near thresholds get per-tick recalculation; healthy accounts batch-checked every 30 seconds.

Exchange Circuit Breakers (LULD)

Stock moves >5% in 5 min → 5-min halt. Market-wide S&P 500 drops: -7% = 15-min halt, -13% = 15-min halt, -20% = market closes for the day. System must detect halt messages, stop accepting orders for halted symbols, and handle the flood when trading resumes.

08

Portfolio Equity Curve Generation

The portfolio value at any point = cash_at_t + Σ(shares_held_at_t × price_at_t). For 25M users across years of history, this is a massive data problem.

Intraday (1-Day View)

Snapshot + delta + live prices. Start-of-day snapshot (precomputed at market open) + position changes from trades today + current prices = equity curve. 78 data points (5-min intervals across 6.5 hrs). Computed on-demand per request in ~5ms — no need to precompute for all users.

Historical (1-Week to All-Time)

Nightly batch job at 1:00 AM: for each user with positions, compute close-of-day equity = cash + Σ(qty × closing_price). Store in portfolio_snapshots table (user_id, date, equity). 25M users × 252 days × 100 bytes = 630 GB/year. Partition by date, archive >2 years to Parquet on S3.

Corporate Action Adjustments

Stock splits make historical snapshots wrong. When AAPL splits 4:1, historical price data is retroactively adjusted (divided by 4). Affected portfolio snapshots must be recomputed using split-adjusted prices — a batch job triggered by the corporate actions processor.

09

Idempotency & Failure Recovery Patterns

Six-Layer Idempotency Map

Layer	Mechanism	Protects Against
1. Mobile Client	UUID idempotency key, local storage	User taps "buy" twice
2. API Gateway	Redis SET NX + cached response	Network retry duplicates
3. Order Service	PostgreSQL UNIQUE constraint	Redis miss, concurrent requests
4. Kafka Producer	enable.idempotence=true, PID+seq	Producer retry sends same message twice
5. Execution Engine	RocksDB processed_order_ids set	Consumer restart reprocesses message
6. Exchange (FIX)	Unique ClOrdID per session/day	FIX session retry sends duplicate order

Transactional Outbox Pattern

BEGIN;
  INSERT INTO orders (...) VALUES (...);
  INSERT INTO order_events (...) VALUES (...);
  INSERT INTO outbox (topic, key, payload) VALUES ('orders.submitted', user_id, order_json);
COMMIT;
-- Outbox relay polls, publishes to Kafka, marks published=true
-- Alternative: Debezium CDC reads WAL, auto-publishes (no outbox table)

Poison Pill Handling

Unprocessable messages in Kafka (malformed, references deleted user) → Dead Letter Queue. Transient errors (DB timeout): don't commit offset, retry with backoff. Non-retriable errors: send to DLQ, commit offset to unblock partition, alert ops team.

Exactly-once proof: No single point can cause a duplicate trade. No single point can cause a lost trade (that was accepted). Each layer's dedup mechanism is independent, so even if one fails, the next catches it. Belt, suspenders, and a parachute.

10

Regulatory Audit Trail & Compliance

SEC Rule 17a-4

Every broker-dealer must preserve all order records, execution records, customer communications, and account records for minimum 6 years (first 2 easily accessible). Format: WORM (Write Once, Read Many) — cannot be altered or deleted. Must be producible within 24 hours of SEC/FINRA request.

Compliance Event Bus

Every service → Kafka audit.* topics → Consumers:
  ├─ S3 (WORM, Object Lock COMPLIANCE mode, 7-year retention, Parquet)
  ├─ Elasticsearch (2-year hot store, searchable within seconds)
  ├─ Compliance Dashboard (real-time monitoring)
  └─ Third-Party Archive (FINRA-approved vendor)

Daily volume: 4M orders × 5 events × 1KB = 20 GB/day raw → ~4 GB compressed

Real-Time Surveillance

Apache Flink streaming application consumes from audit.orders, maintains sliding windows per user per symbol, and detects:

Wash Trading

Buy then sell same stock at same price within minutes to create false volume. Flag if same symbol ±0.5% price within 5 minutes.

Spoofing / Layering

Place large limit orders to move price, cancel before fill. Flag if >1000 shares cancelled within 30 seconds, 3+ times/day.

Front-Running

Employee trades same symbol within 15 minutes before large customer order. All employee accounts tagged for monitoring.

Pump and Dump

Social media spike + buy activity by small group + subsequent sell after price increase. Requires external social media feed integration.

Required Filings

Rule 606 (quarterly): where orders were routed, PFOF received. CAT (daily): every order lifecycle event to FINRA. SAR (as needed): suspicious activity within 30 days of detection to FinCEN.

07

Simpler, unidirectional. But changing subscriptions requires new connection. Can't multiplex order updates on same stream without workarounds.

08

What Can Go Wrong

Double Execution

Network partition causes execution engine to submit same order twice. Mitigation: Exchange-side ClOrdID is idempotent — exchange rejects duplicates. Execution engine checks order state before submitting.

Fill Message Lost

Exchange fills order but FIX message never arrives. Mitigation: End-of-day reconciliation compares internal state against exchange reports. FIX sequence number recovery resends missed messages on reconnect.

Buying Power Race Condition

Two concurrent orders drain same balance. Mitigation: SELECT ... FOR UPDATE serializes concurrent orders for same user. Row-level lock on account.

WebSocket Tier Crash

Users lose real-time prices. Mitigation: Client auto-reconnects with jitter. REST fallback for latest prices. Servers are stateless — subscriptions re-established on reconnect.

Kafka Consumer Lag

Orders accepted but execution delayed. Mitigation: Monitor consumer lag. Auto-scale execution engine consumers. Alert if lag exceeds 5-second SLA.

ACH Bounce After Spending

User spends instant credit, then ACH bounces. Mitigation: Freeze account, 5-day grace period, force-liquidate if unresolved. Revoke instant deposit privilege for repeat offenders.

Database Failover Mid-Transaction

PostgreSQL primary fails during order write. Mitigation: Synchronous replication to standby. Transaction either commits on both or rolls back. Client retries with same idempotency key.

09

Interview Tips

💡

Lead with the two-plane split. Say early: "This system has two fundamentally different data paths — orders need CP, market data needs AP. I'll design them separately." This immediately signals architectural maturity.

⚡

Show the money math. Walk through one concrete order: "User has $5,000, buys 10 AAPL at ~$187. Reserve $1,870. Fill at $187.42. Debit $1,874.20, release excess reserve." Makes the consistency requirement tangible.

🎯

The idempotency key is your anchor. When discussing failure modes, always come back to: "The idempotency key ensures that no matter where the system fails and retries, the user never gets a duplicate trade."

📋

Mention regulatory requirements. "FINRA requires full audit trails" and "SEC Reg NMS governs best execution" — one sentence each shows domain awareness that separates you from generic answers.

🔑

Don't design a matching engine unless asked. Robinhood routes to exchanges; it doesn't match orders. If asked to design an exchange, that's a different (harder) problem. Clarify scope early.

🏗️

Buying power complexity. One powerful sentence: "Buying power looks simple but it's a real-time derived state depending on settlement cycles, margin, pending reserves, and regulatory constraints — I'd build it as an event-driven materialized view with transactional revalidation."

10

Evolution

How this design grows from MVP to planet-scale.

1

MVP — Single Broker, Basic Orders (10K users)

Monolith. PostgreSQL for everything. REST polling for prices. Market orders only. Single broker connection. Works perfectly for a small user base.

2

Real-Time & Scale (100K–1M users)

Split into Order Service + Market Data Service. Add WebSocket tier for live prices. Redis for price caching and pub/sub. Kafka between order acceptance and execution. Limit orders.

3

Multi-Broker & Reliability (1M–10M users)

Multiple broker/exchange connections for best execution routing (Reg NMS). PostgreSQL read replicas. Horizontal WebSocket tier. End-of-day reconciliation. Options trading.

4

Planet-Scale (10M+ users)

Account DB sharded by user_id. Dedicated risk engine for margin/options. Real-time fraud detection (Flink). Multi-region with order routing to nearest exchange gateway. Crypto (24/7 markets). Tax-lot tracking, wash sale detection.

📺

References & Videos

Design a Stock Trading Platform

Jordan Has No Life · 25 min

Design a Stock Exchange

AlgoMaster