Stock Trading Platform

Design a Robinhood-scale platform that handles money and stock ownership with zero tolerance for inconsistency, while simultaneously streaming real-time price data to millions of users.

Financial SystemsReal-Time StreamingStrong ConsistencyEvent SourcingWebSocketFIX Protocol
01

Problem Statement

Design a stock trading platform that allows users to buy and sell stocks, view real-time market data, and manage their portfolio. The system must handle 25M registered users with 5M daily active users and 2M concurrent WebSocket connections during market hours.

This system has two fundamentally different halves: a read-heavy, latency-sensitive market data pipeline and a write-heavy, correctness-critical order execution path. The architecture must treat these as separate planes with different consistency guarantees.

Core question: How do you build a system where the trading path demands CP (consistency over availability) while the market data path demands AP (availability over consistency) — and both share a user identity?

02

Requirements

Functional Requirements

Core Trading

  • Place orders — market, limit, stop-loss, stop-limit
  • Cancel/modify pending (unfilled) orders
  • Support for fractional shares (buy $5 of AAPL instead of 1 full share)
  • Extended hours trading (pre-market 4am–9:30am, after-hours 4pm–8pm ET)

Portfolio & Account

  • View portfolio — holdings, real-time P&L, cost basis
  • Portfolio equity chart over time (the iconic line graph)
  • Deposit/withdraw funds (ACH bank transfer, 1-3 day settlement)
  • Buying power calculation — settled cash + unsettled funds (T+1) + instant deposit credit + margin − pending reserves − regulatory holds

Market Data

  • Real-time price tickers (bid/ask/last/volume)
  • Historical candlestick data (1min, 5min, 1hr, 1day candles)
  • Order book depth (Level 2 — top 5 bids/asks)

Compliance & Regulatory

  • Pattern Day Trader (PDT) detection — FINRA rule: 4+ day trades in 5 business days with account < $25K restricts trading
  • Tax lot tracking for 1099 generation
  • KYC/AML — identity verification at signup
  • Immutable audit log — SEC Rule 17a-4 requires 7+ year retention

Non-Functional Requirements

Consistency & Correctness

  • Exactly-once order execution — no duplicate fills, ever
  • Atomic balance updates — cash debit and share credit together or not at all
  • Read-after-write guarantee on order path
  • T+1 settlement tracking (since May 2024)

Availability & Latency

  • Order acknowledgment < 200ms
  • Market data latency < 500ms exchange-to-screen
  • 99.99% uptime during market hours (~39 min downtime/year max)
  • Graceful degradation — trading works if market data is down, and vice versa
03

Scale Estimation

All numbers derived from 25M registered users, 5M DAU, 20% active traders. Market hours are 6.5 hrs/day, 252 trading days/year. Traffic is extremely spiky — 25% of daily volume in the first 30 minutes.

3K/sec
Peak Orders/sec
2M
WebSocket Connections
500K/sec
Market Data Ticks In
1.1 TB/yr
Order Storage
1.1 TB/day
Tick Data Raw
3K/sec
Portfolio Reads Peak
40+
WebSocket Servers
~10K
Active Symbols

Derivation

5M DAU: 20% active traders (1M) × 4 orders/day = 4M orders/day. Average: 170/sec across 6.5 hrs. Market-open spike (25% of volume in 30 min) = 555/sec. Burst within open (3-5×) = 1,500-2,800/sec. Design target: 3K/sec with headroom. GameStop-style events can hit 10K/sec.

2M concurrent users × 80% with app open = 1.6M WebSocket connections, rounded to 2M. At 50K connections/server = 40 servers. Market data fan-out after batching: ~50K msg/sec through Redis, fan-out to connections at WebSocket server level.

Key insight: Market data fan-out — not orders — is the actual scalability bottleneck. 2M users × 20 symbols × 5 updates/sec = hundreds of millions of messages/sec outbound. This drives the WebSocket tier sizing.

04

API Design

Four API domains with different consistency requirements. Every mutating endpoint requires X-Idempotency-Key header. All prices as strings (not floats) to avoid rounding errors.

Trading API

Place Order
POST /v1/orders
Headers: Authorization: Bearer <token>, X-Idempotency-Key: <uuid>

{
  "symbol": "AAPL",
  "side": "buy",                    // buy | sell
  "type": "limit",                  // market | limit | stop | stop_limit
  "quantity": 10,                   // OR "notional": "500.00" (fractional)
  "limit_price": "185.50",          // string, not float
  "time_in_force": "day",           // day | gtc | ioc | fok
  "extended_hours": false
}

→ 201 { order_id, status: "pending_new", reserved_amount: "1855.00", etag: "v1" }

Errors: 422 INSUFFICIENT_BUYING_POWER | 422 SYMBOL_HALTED
        403 ACCOUNT_RESTRICTED (PDT) | 409 DUPLICATE_ORDER
Cancel & Modify
DELETE /v1/orders/{order_id}
→ 200 { status: "pending_cancel" }   // NOT "cancelled" — async confirmation

PATCH /v1/orders/{order_id}
Headers: If-Match: "v1"              // optimistic lock via ETag
{ "limit_price": "184.00" }
→ 200 { status: "pending_replace", etag: "v2" }

Market Data — WebSocket

Real-time Streaming (single connection, multiple channels)
WebSocket /v1/market/stream

Client → Server: { "action": "subscribe", "channel": "quotes", "symbols": ["AAPL","TSLA"] }
Server → Client: { "channel": "quotes", "symbol": "AAPL", "bid": "187.42",
                    "ask": "187.44", "last": "187.43", "volume": 48293017 }
Server → Client: { "channel": "orders", "order_id": "ord_abc", "status": "filled",
                    "avg_fill_price": "185.32" }
Server → Client: { "channel": "heartbeat", "ts": "..." }   // every 30s

Portfolio API

Portfolio Snapshot
GET /v1/portfolio → 200
{
  "equity": "47832.50",
  "cash": { "settled": "10800.50", "unsettled": "1500.00", "withdrawable": "10800.50" },
  "buying_power": "24600.00",
  "day_trades_remaining": 3,
  "positions": [
    { "symbol": "AAPL", "quantity": "10.000000", "avg_cost": "185.32",
      "current_price": "187.45", "unrealized_pnl": "21.30" }
  ],
  "as_of": "2025-06-10T14:30:05Z"  // snapshot timestamp
}

Design principle: POST /orders returns acknowledgment, not fill confirmation. Fills arrive asynchronously via WebSocket — because the order goes through validation, risk checks, and exchange routing before execution.

05

High-Level Architecture

Two-plane architecture: Trading Plane (CP) optimized for correctness and exactly-once execution, and Market Data Plane (AP) optimized for throughput and low latency. They connect at the WebSocket Gateway and Portfolio Service.

TRADING PLANE (CP) — TOP · MARKET DATA PLANE (AP) — BOTTOM Client Mobile / Web API Gateway Auth · Rate Limit Order Service Validate · Reserve Kafka orders.* Execution Engine SOR · FIX Exchanges NYSE · NASDAQ Fill Processor Atomic Update PostgreSQL Orders · Accounts Portfolio Service Read Replicas Exchange Feeds CTA · UTP MD Ingestion NBBO · Normalize Redis Pub/Sub Per-Symbol Channels WebSocket Gateway 40 servers · 2M conns TimescaleDB Tick History HTTPS REST Publish Consume FIX Fills Update Binary Feed Publish Subscribe Push Store

Infrastructure Sizing

ComponentInstancesSpecs
API Gateway4-64 CPU, 8GB RAM
Order Service6-104 CPU, 16GB RAM
Execution Engine4-88 CPU, 16GB RAM
Fill Processor4-64 CPU, 8GB RAM
Portfolio Service8-124 CPU, 16GB RAM
WebSocket Gateway40-604 CPU, 32GB RAM
PostgreSQL Primary132 CPU, 128GB RAM, NVMe
Redis Cluster6 nodes16GB RAM each
Kafka Cluster5 brokers8 CPU, 32GB RAM, SSD
Request Flow — Step Through
ClientAPI GatewayOrder ServicePostgreSQLKafkaExecution EngineExchange (FIX)Fill ProcessorPortfolio DBWebSocket Push
Click Next Step to walk through the request flow.
06

Deep Dives — 10 Rabbit Holes

Each rabbit hole explores a critical subsystem at implementation depth. Click to jump to any section.

01

Order Lifecycle & Exactly-Once Execution

Every order is a finite state machine with defined transitions. Any transition not in the map is rejected. State is derived from an append-only event log, not mutable status columns.

stateDiagram-v2 [*] --> CREATED CREATED --> PENDING_NEW: Publish to Kafka PENDING_NEW --> SUBMITTED: FIX NewOrderSingle SUBMITTED --> PARTIALLY_FILLED: Exchange partial fill SUBMITTED --> FILLED: Exchange full fill SUBMITTED --> REJECTED: Exchange rejects SUBMITTED --> PENDING_CANCEL: User cancels SUBMITTED --> EXPIRED: Market closes (TIF=day) PARTIALLY_FILLED --> PARTIALLY_FILLED: More fills PARTIALLY_FILLED --> FILLED: Final fill PARTIALLY_FILLED --> PENDING_CANCEL: User cancels remainder PENDING_CANCEL --> CANCELLED: Cancel confirmed PENDING_CANCEL --> FILLED: Filled before cancel arrived

Event Sourcing — How State Is Stored

We never do UPDATE orders SET status = 'FILLED'. Instead, every state change is an immutable event appended to the order_events table. Current status is derived from the latest event. This gives FINRA-compliant audit trails for free.

INSERT INTO order_events (order_id, event_type, fill_qty, fill_price, venue, ts)
VALUES ('ord_abc', 'PARTIAL_FILL', 4, 187.30, 'XNYS', NOW());

-- Performance optimization: materialized status column
-- Updated in same transaction as event insert
BEGIN;
  INSERT INTO order_events (...) VALUES (...);
  UPDATE orders SET status = 'FILLED', filled_qty = 10,
    avg_fill_price = 187.42, updated_at = NOW()
  WHERE id = 'ord_abc';
COMMIT;
-- Event log is source of truth; status column is read optimization

The Exactly-Once Problem — Three Boundaries

Boundary 1: Client → Gateway

Client retries POST /orders due to timeout. Gateway checks idempotency key in Redis — if exists, returns cached response. PostgreSQL UNIQUE constraint on orders.idempotency_key is the definitive backstop.

Boundary 2: Order Service → Kafka

Order committed to DB but Kafka publish fails. Transactional Outbox pattern: write order + outbox row in same transaction. Outbox relay publishes to Kafka. Alternative: Debezium CDC reads PostgreSQL WAL and auto-publishes.

Boundary 3: Execution Engine → Exchange

FIX message sent, no response, engine retries. Exchange rejects duplicate ClOrdID. FIX sequence number recovery resends missed messages on reconnect.

Partial Fills

A limit order for 100 shares may fill as 30 + 20 + 50 across different venues. Each partial fill triggers an atomic update: insert event, update position with weighted average cost, release proportional reserve, debit actual fill cost.

-- Weighted average cost update on each fill
new_avg_cost = (old_avg_cost × old_qty + fill_price × fill_qty) / (old_qty + fill_qty)

-- Reserve release per partial fill
pro_rata_reserve = (fill_qty / order_qty) × total_reserved
released = pro_rata_reserve - actual_fill_cost
-- Net: user gets back excess reserve as buying power
02

Buying Power as a Derived State Machine

Buying power looks like a simple number but is a real-time derived state that depends on 6 different data sources: settled cash, unsettled cash (T+1), instant deposit credit, margin allowance, pending order reserves, and regulatory holds.

buying_power = settled_cash
             + unsettled_cash        // from recent sells, T+1 settlement
             + instant_deposit_credit // ACH pending, capped by tier
             + margin_allowance       // if margin account, 2x leverage
             - reserved_for_pending   // buy orders lock funds
             - maintenance_minimum    // margin maintenance requirement
             - regulatory_holds       // PDT restrictions, GFV limits

T+1 Settlement

Sell AAPL on Monday → cash is "unsettled" until Tuesday. Unsettled cash CAN buy stocks but CANNOT be withdrawn. Buying with unsettled funds then selling before settlement = Good Faith Violation. Three GFVs in 12 months restricts account to settled-cash trading for 90 days.

The ACH Bounce Nightmare

Worst-case edge flow

Day 1: User deposits $5,000 via ACH, gets instant credit, buys $4,800 of TSLA. Day 3: ACH bounces. System must reverse $5,000 credit → user has negative buying power. Account frozen, user has 5 days to deposit or positions are force-liquidated.

Event-Driven Materialized View

Instead of recomputing from scratch, maintain a running snapshot updated incrementally on every relevant event (order created, filled, cancelled, ACH status change, settlement, price tick for margin). Kafka Streams consumer with RocksDB state store writes to Redis for the fast-path check. Order Service does pessimistic revalidation in PostgreSQL inside the transaction.

// Fast path (optimistic, no DB lock)
bp = redis.get("bp:" + userId)
if (bp < orderAmount) → reject immediately (no DB hit needed)

// Slow path (inside transaction, authoritative)
BEGIN;
  SELECT buying_power FROM accounts WHERE id = $1 FOR UPDATE;
  -- revalidate against actual data, then proceed
COMMIT;
03

Market Data Fan-Out at Scale

The actual scalability bottleneck. 100K-500K ticks/sec inbound from exchanges, fan out to 2M concurrent subscribers with different watchlists.

Inbound: Exchange Feed Processing

CTA (NYSE-listed) and UTP (NASDAQ-listed) feeds deliver binary market data. The ingestion service parses at < 5μs per message, maintains an in-memory NBBO book per symbol, and only publishes downstream when the NBBO actually changes — cutting downstream load by ~80%.

// NBBO change filter — critical optimization
On each quote update from exchange:
  Update bid/ask from reporting exchange
  Recompute best_bid = max(all_exchange_bids)
  Recompute best_ask = min(all_exchange_asks)
  If NBBO changed → PUBLISH market:{symbol} to Redis
  If NBBO unchanged → skip (saves 80% of publishes)

Fan-Out: Redis Pub/Sub → WebSocket Servers

Each WebSocket server subscribes to Redis channels only for symbols its connected users watch. When AAPL ticks, Redis sends one message to each subscribing server, which then multicasts to all local connections watching AAPL.

✓ Chosen

Redis Pub/Sub

Fire-and-forget. <1ms latency. If subscriber is slow, messages drop — acceptable for ticks. Latest value always wins.

✗ Alternative

Kafka for Market Data

Persistent, guaranteed delivery. But 5-20ms latency, and stale ticks queuing up is worse than dropping them.

WebSocket Server Internal Fan-Out

Hot path: Redis delivers one AAPL message → server looks up HashMap<Symbol, Set<ConnectionId>> → maybe 15,000 connections need it → serialize once → write binary frame to 15,000 TCP buffers. Server-side batching at 50ms windows deduplicates multiple ticks for the same symbol and uses writev() gathering writes to reduce syscalls.

Per-Server Numbers

50K connections, top 10 symbols at 15K subscribers each × 5 ticks/sec = 750K writes/sec. Mid-tier 1000 symbols at 500 subscribers × 2 ticks/sec = 1M writes/sec. Total: ~1.75M write operations/sec per server, ~175 MB/sec outbound. Requires 2 Gbps NIC with headroom.

04

Smart Order Routing & Reg NMS

When a user submits "BUY 100 AAPL MARKET," the order could go to 13 US exchanges, multiple market makers (Citadel, Virtu), or dark pools. The Smart Order Router (SOR) selects the best venue.

Reg NMS — The Legal Constraint

Rule 611 (Order Protection Rule): you cannot execute a trade at a price inferior to the National Best Bid/Offer (NBBO). If NYSE offers AAPL at $187.42 and you route to Citadel who fills at $187.45, you violated Reg NMS. The SOR must compare all venue quotes against the NBBO before routing.

SOR Decision Engine

Scoring per venue (weighted):
  Price improvement:     40% — filling below NBBO saves customer money
  Historical fill rate:  25% — what % of orders at this venue actually fill?
  Speed:                 15% — how fast does the venue respond?
  PFOF rate:             10% — payment for order flow revenue
  Stability:             10% — reject rate penalty

PFOF Economics — How Robinhood Makes Money

Revenue model

Market makers pay $0.002-$0.004/share for retail order flow. 4M orders/day × 20 shares avg × $0.003 = $240K/day from equities. Options PFOF at $0.40-$0.60/contract is 2-3× more profitable. Total PFOF revenue: ~$150-200M/year. Market makers profit because retail flow is "uninformed" — less likely to be followed by adverse price moves.

05

Reconciliation & Settlement

The safety net that catches every edge case the real-time systems miss. Runs daily after market close.

End-of-Day Reconciliation (5:00 PM ET)

Step 1: Pull execution reports from all brokers (FIX drop copy / API)
Step 2: Pull internal order_events for the day
Step 3: Match by ClOrdID
Step 4: Compare each pair — qty, price, venue, status
Step 5: Classify discrepancies:
  Cat A — Missing internally (most critical): reconstruct from exchange data
  Cat B — Missing at exchange (suspicious): quarantine, investigate
  Cat C — Price/qty mismatch: auto-correct if < $0.01, else manual review
  Cat D — Status mismatch: update our status, release held reserves
Step 6: Report — typically 99.999% perfect match rate

Corporate Actions

Stock splits, reverse splits, dividends, mergers, and spinoffs change positions overnight without user action. A corporate actions processor runs nightly: cancels pending orders for affected symbols, adjusts position quantities and cost basis, creates audit events, and notifies users. Stock splits also require retroactive adjustment of historical price data and portfolio snapshots.

06

WebSocket Connection Management at 2M Scale

Memory Per Connection

TCP socket buffers:     ~32 KB (tuned down from 128KB default)
TLS session state:      ~4 KB
WebSocket frame buffer: ~8 KB
Application state:      ~3 KB (metadata + subscriptions + ring buffer)
─────────────────────────────────
Total: ~47 KB per connection (tuned) to ~160 KB (default)
Per server (50K conns): 8-16 GB RAM for connections alone

L4 vs L7 Load Balancing

✓ Chosen

L4 (NLB / HAProxy TCP)

Forwards TCP packets directly. Single connection client→server. LB tracks only {src:port → dst:port} mapping (~100 bytes each). 2M × 100 bytes = 200 MB — trivial.

✗ Alternative

L7 (ALB / Nginx)

Terminates WebSocket, creates new connection to backend. Doubles connection count (LB holds 2M + servers hold 2M). 320 GB RAM on LB alone. LB becomes throughput bottleneck.

Reconnection Storm Mitigation

When WS-17 crashes, 50K clients reconnect simultaneously. Without mitigation: thundering herd. Solution: client jitter (random 0-10s delay), server-side rate limiting (max 500 new connections/sec), LB health checks (overloaded servers marked unhealthy). Graceful deploys send {"action":"reconnect","delay_ms":random(0,30000)} before draining.

Order Event Routing to Correct Server

Connection registry pattern: on connect, HSET ws:connections {user_id} {server_id} in Redis. Order event consumer does one Redis lookup per fill to route to the correct WebSocket server via internal channel.

07

Risk Engine & Circuit Breakers

Pre-Trade Risk Pipeline (7 checks, < 20ms)

#CheckFailure
1Buying power sufficientINSUFFICIENT_BUYING_POWER
2Position limits ($500K max per symbol)POSITION_LIMIT_EXCEEDED
3Order rate limits (60/min, $1M/min notional)RATE_LIMIT_EXCEEDED
4Price reasonability (limit within 10% of market)PRICE_UNREASONABLE
5Symbol tradability (not halted, session valid)SYMBOL_HALTED
6PDT rule (day trades remaining > 0)PDT_RESTRICTION
7Good Faith Violation preventionGFV_WARNING

Real-Time Margin Monitoring

Continuously compute margin_ratio = equity / maintenance_requirement for every margin account. Thresholds: >2.0 healthy, 1.5-2.0 warning (push notification), 1.0-1.5 margin call (restrict buying, 2-5 day grace), <1.0 forced liquidation (auto-sell most liquid positions first), <0.75 emergency liquidation (sell everything). Watch list optimization: only accounts near thresholds get per-tick recalculation; healthy accounts batch-checked every 30 seconds.

Exchange Circuit Breakers (LULD)

Stock moves >5% in 5 min → 5-min halt. Market-wide S&P 500 drops: -7% = 15-min halt, -13% = 15-min halt, -20% = market closes for the day. System must detect halt messages, stop accepting orders for halted symbols, and handle the flood when trading resumes.

08

Portfolio Equity Curve Generation

The portfolio value at any point = cash_at_t + Σ(shares_held_at_t × price_at_t). For 25M users across years of history, this is a massive data problem.

Intraday (1-Day View)

Snapshot + delta + live prices. Start-of-day snapshot (precomputed at market open) + position changes from trades today + current prices = equity curve. 78 data points (5-min intervals across 6.5 hrs). Computed on-demand per request in ~5ms — no need to precompute for all users.

Historical (1-Week to All-Time)

Nightly batch job at 1:00 AM: for each user with positions, compute close-of-day equity = cash + Σ(qty × closing_price). Store in portfolio_snapshots table (user_id, date, equity). 25M users × 252 days × 100 bytes = 630 GB/year. Partition by date, archive >2 years to Parquet on S3.

Corporate Action Adjustments

Stock splits make historical snapshots wrong. When AAPL splits 4:1, historical price data is retroactively adjusted (divided by 4). Affected portfolio snapshots must be recomputed using split-adjusted prices — a batch job triggered by the corporate actions processor.

09

Idempotency & Failure Recovery Patterns

Six-Layer Idempotency Map

LayerMechanismProtects Against
1. Mobile ClientUUID idempotency key, local storageUser taps "buy" twice
2. API GatewayRedis SET NX + cached responseNetwork retry duplicates
3. Order ServicePostgreSQL UNIQUE constraintRedis miss, concurrent requests
4. Kafka Producerenable.idempotence=true, PID+seqProducer retry sends same message twice
5. Execution EngineRocksDB processed_order_ids setConsumer restart reprocesses message
6. Exchange (FIX)Unique ClOrdID per session/dayFIX session retry sends duplicate order

Transactional Outbox Pattern

BEGIN;
  INSERT INTO orders (...) VALUES (...);
  INSERT INTO order_events (...) VALUES (...);
  INSERT INTO outbox (topic, key, payload) VALUES ('orders.submitted', user_id, order_json);
COMMIT;
-- Outbox relay polls, publishes to Kafka, marks published=true
-- Alternative: Debezium CDC reads WAL, auto-publishes (no outbox table)

Poison Pill Handling

Unprocessable messages in Kafka (malformed, references deleted user) → Dead Letter Queue. Transient errors (DB timeout): don't commit offset, retry with backoff. Non-retriable errors: send to DLQ, commit offset to unblock partition, alert ops team.

Exactly-once proof: No single point can cause a duplicate trade. No single point can cause a lost trade (that was accepted). Each layer's dedup mechanism is independent, so even if one fails, the next catches it. Belt, suspenders, and a parachute.

10

Regulatory Audit Trail & Compliance

SEC Rule 17a-4

Every broker-dealer must preserve all order records, execution records, customer communications, and account records for minimum 6 years (first 2 easily accessible). Format: WORM (Write Once, Read Many) — cannot be altered or deleted. Must be producible within 24 hours of SEC/FINRA request.

Compliance Event Bus

Every service → Kafka audit.* topics → Consumers:
  ├─ S3 (WORM, Object Lock COMPLIANCE mode, 7-year retention, Parquet)
  ├─ Elasticsearch (2-year hot store, searchable within seconds)
  ├─ Compliance Dashboard (real-time monitoring)
  └─ Third-Party Archive (FINRA-approved vendor)

Daily volume: 4M orders × 5 events × 1KB = 20 GB/day raw → ~4 GB compressed

Real-Time Surveillance

Apache Flink streaming application consumes from audit.orders, maintains sliding windows per user per symbol, and detects:

Wash Trading

Buy then sell same stock at same price within minutes to create false volume. Flag if same symbol ±0.5% price within 5 minutes.

Spoofing / Layering

Place large limit orders to move price, cancel before fill. Flag if >1000 shares cancelled within 30 seconds, 3+ times/day.

Front-Running

Employee trades same symbol within 15 minutes before large customer order. All employee accounts tagged for monitoring.

Pump and Dump

Social media spike + buy activity by small group + subsequent sell after price increase. Requires external social media feed integration.

Required Filings

Rule 606 (quarterly): where orders were routed, PFOF received. CAT (daily): every order lifecycle event to FINRA. SAR (as needed): suspicious activity within 30 days of detection to FinCEN.

07

Key Design Decisions & Tradeoffs

1. Pessimistic Locking vs. Optimistic Concurrency

✓ Chosen

Pessimistic (FOR UPDATE)

Account row locked during order creation. Correctness guaranteed. Low contention per user (one person rarely submits concurrent orders). Simpler error handling.

✗ Alternative

Optimistic (version check)

Would work but retry logic is complex. User sees confusing errors on conflict. In financial systems, correctness beats throughput.

2. Event-Sourced Orders vs. Mutable Status

✓ Chosen

Event Sourcing + Materialized Column

Append-only events give full audit trail (FINRA compliance for free). Materialized status column for read performance. Best of both worlds.

✗ Alternative

Mutable Status Column Only

Simpler, but lose audit trail. Must build separate logging. In a regulated industry, you'll build the event log eventually anyway.

3. PostgreSQL vs. NoSQL for Orders

✓ Chosen

PostgreSQL

ACID transactions essential for "debit cash + credit shares" atomicity. 3K orders/sec well within capacity. Rich query support for reporting. Strong consistency.

✗ Alternative

DynamoDB

Higher throughput ceiling but weaker transactions. Multi-item atomic operations possible but complex. Overkill for this throughput level.

4. Transactional Outbox vs. Debezium CDC

✓ Chosen

Transactional Outbox

Explicit control over Kafka message format. Outbox table + relay is simple to reason about and debug. Decouples DB schema from Kafka topic schema.

✗ Alternative

Debezium CDC

No outbox table needed — reads WAL directly. But couples Kafka schema to DB schema. Harder to debug. Adds operational dependency (Debezium cluster).

5. WebSocket vs. SSE for Market Data

✓ Chosen

WebSocket

Bidirectional — client subscribes/unsubscribes dynamically. Multiplexed channels (quotes + orders + account on one connection). Efficient for high-frequency updates.

✗ Alternative

Server-Sent Events (SSE)

Simpler, unidirectional. But changing subscriptions requires new connection. Can't multiplex order updates on same stream without workarounds.

08

What Can Go Wrong

Double Execution

Network partition causes execution engine to submit same order twice. Mitigation: Exchange-side ClOrdID is idempotent — exchange rejects duplicates. Execution engine checks order state before submitting.

Fill Message Lost

Exchange fills order but FIX message never arrives. Mitigation: End-of-day reconciliation compares internal state against exchange reports. FIX sequence number recovery resends missed messages on reconnect.

Buying Power Race Condition

Two concurrent orders drain same balance. Mitigation: SELECT ... FOR UPDATE serializes concurrent orders for same user. Row-level lock on account.

WebSocket Tier Crash

Users lose real-time prices. Mitigation: Client auto-reconnects with jitter. REST fallback for latest prices. Servers are stateless — subscriptions re-established on reconnect.

Kafka Consumer Lag

Orders accepted but execution delayed. Mitigation: Monitor consumer lag. Auto-scale execution engine consumers. Alert if lag exceeds 5-second SLA.

ACH Bounce After Spending

User spends instant credit, then ACH bounces. Mitigation: Freeze account, 5-day grace period, force-liquidate if unresolved. Revoke instant deposit privilege for repeat offenders.

Database Failover Mid-Transaction

PostgreSQL primary fails during order write. Mitigation: Synchronous replication to standby. Transaction either commits on both or rolls back. Client retries with same idempotency key.

09

Interview Tips

💡
Lead with the two-plane split. Say early: "This system has two fundamentally different data paths — orders need CP, market data needs AP. I'll design them separately." This immediately signals architectural maturity.
Show the money math. Walk through one concrete order: "User has $5,000, buys 10 AAPL at ~$187. Reserve $1,870. Fill at $187.42. Debit $1,874.20, release excess reserve." Makes the consistency requirement tangible.
🎯
The idempotency key is your anchor. When discussing failure modes, always come back to: "The idempotency key ensures that no matter where the system fails and retries, the user never gets a duplicate trade."
📋
Mention regulatory requirements. "FINRA requires full audit trails" and "SEC Reg NMS governs best execution" — one sentence each shows domain awareness that separates you from generic answers.
🔑
Don't design a matching engine unless asked. Robinhood routes to exchanges; it doesn't match orders. If asked to design an exchange, that's a different (harder) problem. Clarify scope early.
🏗️
Buying power complexity. One powerful sentence: "Buying power looks simple but it's a real-time derived state depending on settlement cycles, margin, pending reserves, and regulatory constraints — I'd build it as an event-driven materialized view with transactional revalidation."
11

Evolution

How this design grows from MVP to planet-scale.

1

MVP — Single Broker, Basic Orders (10K users)

Monolith. PostgreSQL for everything. REST polling for prices. Market orders only. Single broker connection. Works perfectly for a small user base.

2

Real-Time & Scale (100K–1M users)

Split into Order Service + Market Data Service. Add WebSocket tier for live prices. Redis for price caching and pub/sub. Kafka between order acceptance and execution. Limit orders.

3

Multi-Broker & Reliability (1M–10M users)

Multiple broker/exchange connections for best execution routing (Reg NMS). PostgreSQL read replicas. Horizontal WebSocket tier. End-of-day reconciliation. Options trading.

4

Planet-Scale (10M+ users)

Account DB sharded by user_id. Dedicated risk engine for margin/options. Real-time fraud detection (Flink). Multi-region with order routing to nearest exchange gateway. Crypto (24/7 markets). Tax-lot tracking, wash sale detection.

Next up