06
Deep Dives — 10 Rabbit Holes
Each rabbit hole explores a critical subsystem at implementation depth. Click to jump to any section.
Every order is a finite state machine with defined transitions. Any transition not in the map is rejected. State is derived from an append-only event log, not mutable status columns.
stateDiagram-v2
[*] --> CREATED
CREATED --> PENDING_NEW: Publish to Kafka
PENDING_NEW --> SUBMITTED: FIX NewOrderSingle
SUBMITTED --> PARTIALLY_FILLED: Exchange partial fill
SUBMITTED --> FILLED: Exchange full fill
SUBMITTED --> REJECTED: Exchange rejects
SUBMITTED --> PENDING_CANCEL: User cancels
SUBMITTED --> EXPIRED: Market closes (TIF=day)
PARTIALLY_FILLED --> PARTIALLY_FILLED: More fills
PARTIALLY_FILLED --> FILLED: Final fill
PARTIALLY_FILLED --> PENDING_CANCEL: User cancels remainder
PENDING_CANCEL --> CANCELLED: Cancel confirmed
PENDING_CANCEL --> FILLED: Filled before cancel arrived
Event Sourcing — How State Is Stored
We never do UPDATE orders SET status = 'FILLED'. Instead, every state change is an immutable event appended to the order_events table. Current status is derived from the latest event. This gives FINRA-compliant audit trails for free.
INSERT INTO order_events (order_id, event_type, fill_qty, fill_price, venue, ts)
VALUES ('ord_abc', 'PARTIAL_FILL', 4, 187.30, 'XNYS', NOW());
-- Performance optimization: materialized status column
-- Updated in same transaction as event insert
BEGIN;
INSERT INTO order_events (...) VALUES (...);
UPDATE orders SET status = 'FILLED', filled_qty = 10,
avg_fill_price = 187.42, updated_at = NOW()
WHERE id = 'ord_abc';
COMMIT;
-- Event log is source of truth; status column is read optimization
The Exactly-Once Problem — Three Boundaries
Boundary 1: Client → Gateway
Client retries POST /orders due to timeout. Gateway checks idempotency key in Redis — if exists, returns cached response. PostgreSQL UNIQUE constraint on orders.idempotency_key is the definitive backstop.
Boundary 2: Order Service → Kafka
Order committed to DB but Kafka publish fails. Transactional Outbox pattern: write order + outbox row in same transaction. Outbox relay publishes to Kafka. Alternative: Debezium CDC reads PostgreSQL WAL and auto-publishes.
Boundary 3: Execution Engine → Exchange
FIX message sent, no response, engine retries. Exchange rejects duplicate ClOrdID. FIX sequence number recovery resends missed messages on reconnect.
Partial Fills
A limit order for 100 shares may fill as 30 + 20 + 50 across different venues. Each partial fill triggers an atomic update: insert event, update position with weighted average cost, release proportional reserve, debit actual fill cost.
-- Weighted average cost update on each fill
new_avg_cost = (old_avg_cost × old_qty + fill_price × fill_qty) / (old_qty + fill_qty)
-- Reserve release per partial fill
pro_rata_reserve = (fill_qty / order_qty) × total_reserved
released = pro_rata_reserve - actual_fill_cost
-- Net: user gets back excess reserve as buying power
Buying power looks like a simple number but is a real-time derived state that depends on 6 different data sources: settled cash, unsettled cash (T+1), instant deposit credit, margin allowance, pending order reserves, and regulatory holds.
buying_power = settled_cash
+ unsettled_cash // from recent sells, T+1 settlement
+ instant_deposit_credit // ACH pending, capped by tier
+ margin_allowance // if margin account, 2x leverage
- reserved_for_pending // buy orders lock funds
- maintenance_minimum // margin maintenance requirement
- regulatory_holds // PDT restrictions, GFV limits
T+1 Settlement
Sell AAPL on Monday → cash is "unsettled" until Tuesday. Unsettled cash CAN buy stocks but CANNOT be withdrawn. Buying with unsettled funds then selling before settlement = Good Faith Violation. Three GFVs in 12 months restricts account to settled-cash trading for 90 days.
The ACH Bounce Nightmare
Worst-case edge flow
Day 1: User deposits $5,000 via ACH, gets instant credit, buys $4,800 of TSLA. Day 3: ACH bounces. System must reverse $5,000 credit → user has negative buying power. Account frozen, user has 5 days to deposit or positions are force-liquidated.
Event-Driven Materialized View
Instead of recomputing from scratch, maintain a running snapshot updated incrementally on every relevant event (order created, filled, cancelled, ACH status change, settlement, price tick for margin). Kafka Streams consumer with RocksDB state store writes to Redis for the fast-path check. Order Service does pessimistic revalidation in PostgreSQL inside the transaction.
// Fast path (optimistic, no DB lock)
bp = redis.get("bp:" + userId)
if (bp < orderAmount) → reject immediately (no DB hit needed)
// Slow path (inside transaction, authoritative)
BEGIN;
SELECT buying_power FROM accounts WHERE id = $1 FOR UPDATE;
-- revalidate against actual data, then proceed
COMMIT;
The actual scalability bottleneck. 100K-500K ticks/sec inbound from exchanges, fan out to 2M concurrent subscribers with different watchlists.
Inbound: Exchange Feed Processing
CTA (NYSE-listed) and UTP (NASDAQ-listed) feeds deliver binary market data. The ingestion service parses at < 5μs per message, maintains an in-memory NBBO book per symbol, and only publishes downstream when the NBBO actually changes — cutting downstream load by ~80%.
// NBBO change filter — critical optimization
On each quote update from exchange:
Update bid/ask from reporting exchange
Recompute best_bid = max(all_exchange_bids)
Recompute best_ask = min(all_exchange_asks)
If NBBO changed → PUBLISH market:{symbol} to Redis
If NBBO unchanged → skip (saves 80% of publishes)
Fan-Out: Redis Pub/Sub → WebSocket Servers
Each WebSocket server subscribes to Redis channels only for symbols its connected users watch. When AAPL ticks, Redis sends one message to each subscribing server, which then multicasts to all local connections watching AAPL.
✓ Chosen
Redis Pub/Sub
Fire-and-forget. <1ms latency. If subscriber is slow, messages drop — acceptable for ticks. Latest value always wins.
✗ Alternative
Kafka for Market Data
Persistent, guaranteed delivery. But 5-20ms latency, and stale ticks queuing up is worse than dropping them.
WebSocket Server Internal Fan-Out
Hot path: Redis delivers one AAPL message → server looks up HashMap<Symbol, Set<ConnectionId>> → maybe 15,000 connections need it → serialize once → write binary frame to 15,000 TCP buffers. Server-side batching at 50ms windows deduplicates multiple ticks for the same symbol and uses writev() gathering writes to reduce syscalls.
Per-Server Numbers
50K connections, top 10 symbols at 15K subscribers each × 5 ticks/sec = 750K writes/sec. Mid-tier 1000 symbols at 500 subscribers × 2 ticks/sec = 1M writes/sec. Total: ~1.75M write operations/sec per server, ~175 MB/sec outbound. Requires 2 Gbps NIC with headroom.
When a user submits "BUY 100 AAPL MARKET," the order could go to 13 US exchanges, multiple market makers (Citadel, Virtu), or dark pools. The Smart Order Router (SOR) selects the best venue.
Reg NMS — The Legal Constraint
Rule 611 (Order Protection Rule): you cannot execute a trade at a price inferior to the National Best Bid/Offer (NBBO). If NYSE offers AAPL at $187.42 and you route to Citadel who fills at $187.45, you violated Reg NMS. The SOR must compare all venue quotes against the NBBO before routing.
SOR Decision Engine
Scoring per venue (weighted):
Price improvement: 40% — filling below NBBO saves customer money
Historical fill rate: 25% — what % of orders at this venue actually fill?
Speed: 15% — how fast does the venue respond?
PFOF rate: 10% — payment for order flow revenue
Stability: 10% — reject rate penalty
PFOF Economics — How Robinhood Makes Money
Revenue model
Market makers pay $0.002-$0.004/share for retail order flow. 4M orders/day × 20 shares avg × $0.003 = $240K/day from equities. Options PFOF at $0.40-$0.60/contract is 2-3× more profitable. Total PFOF revenue: ~$150-200M/year. Market makers profit because retail flow is "uninformed" — less likely to be followed by adverse price moves.
The safety net that catches every edge case the real-time systems miss. Runs daily after market close.
End-of-Day Reconciliation (5:00 PM ET)
Step 1: Pull execution reports from all brokers (FIX drop copy / API)
Step 2: Pull internal order_events for the day
Step 3: Match by ClOrdID
Step 4: Compare each pair — qty, price, venue, status
Step 5: Classify discrepancies:
Cat A — Missing internally (most critical): reconstruct from exchange data
Cat B — Missing at exchange (suspicious): quarantine, investigate
Cat C — Price/qty mismatch: auto-correct if < $0.01, else manual review
Cat D — Status mismatch: update our status, release held reserves
Step 6: Report — typically 99.999% perfect match rate
Corporate Actions
Stock splits, reverse splits, dividends, mergers, and spinoffs change positions overnight without user action. A corporate actions processor runs nightly: cancels pending orders for affected symbols, adjusts position quantities and cost basis, creates audit events, and notifies users. Stock splits also require retroactive adjustment of historical price data and portfolio snapshots.
Memory Per Connection
TCP socket buffers: ~32 KB (tuned down from 128KB default)
TLS session state: ~4 KB
WebSocket frame buffer: ~8 KB
Application state: ~3 KB (metadata + subscriptions + ring buffer)
─────────────────────────────────
Total: ~47 KB per connection (tuned) to ~160 KB (default)
Per server (50K conns): 8-16 GB RAM for connections alone
L4 vs L7 Load Balancing
✓ Chosen
L4 (NLB / HAProxy TCP)
Forwards TCP packets directly. Single connection client→server. LB tracks only {src:port → dst:port} mapping (~100 bytes each). 2M × 100 bytes = 200 MB — trivial.
✗ Alternative
L7 (ALB / Nginx)
Terminates WebSocket, creates new connection to backend. Doubles connection count (LB holds 2M + servers hold 2M). 320 GB RAM on LB alone. LB becomes throughput bottleneck.
Reconnection Storm Mitigation
When WS-17 crashes, 50K clients reconnect simultaneously. Without mitigation: thundering herd. Solution: client jitter (random 0-10s delay), server-side rate limiting (max 500 new connections/sec), LB health checks (overloaded servers marked unhealthy). Graceful deploys send {"action":"reconnect","delay_ms":random(0,30000)} before draining.
Order Event Routing to Correct Server
Connection registry pattern: on connect, HSET ws:connections {user_id} {server_id} in Redis. Order event consumer does one Redis lookup per fill to route to the correct WebSocket server via internal channel.
Pre-Trade Risk Pipeline (7 checks, < 20ms)
| # | Check | Failure |
| 1 | Buying power sufficient | INSUFFICIENT_BUYING_POWER |
| 2 | Position limits ($500K max per symbol) | POSITION_LIMIT_EXCEEDED |
| 3 | Order rate limits (60/min, $1M/min notional) | RATE_LIMIT_EXCEEDED |
| 4 | Price reasonability (limit within 10% of market) | PRICE_UNREASONABLE |
| 5 | Symbol tradability (not halted, session valid) | SYMBOL_HALTED |
| 6 | PDT rule (day trades remaining > 0) | PDT_RESTRICTION |
| 7 | Good Faith Violation prevention | GFV_WARNING |
Real-Time Margin Monitoring
Continuously compute margin_ratio = equity / maintenance_requirement for every margin account. Thresholds: >2.0 healthy, 1.5-2.0 warning (push notification), 1.0-1.5 margin call (restrict buying, 2-5 day grace), <1.0 forced liquidation (auto-sell most liquid positions first), <0.75 emergency liquidation (sell everything). Watch list optimization: only accounts near thresholds get per-tick recalculation; healthy accounts batch-checked every 30 seconds.
Exchange Circuit Breakers (LULD)
Stock moves >5% in 5 min → 5-min halt. Market-wide S&P 500 drops: -7% = 15-min halt, -13% = 15-min halt, -20% = market closes for the day. System must detect halt messages, stop accepting orders for halted symbols, and handle the flood when trading resumes.
The portfolio value at any point = cash_at_t + Σ(shares_held_at_t × price_at_t). For 25M users across years of history, this is a massive data problem.
Intraday (1-Day View)
Snapshot + delta + live prices. Start-of-day snapshot (precomputed at market open) + position changes from trades today + current prices = equity curve. 78 data points (5-min intervals across 6.5 hrs). Computed on-demand per request in ~5ms — no need to precompute for all users.
Historical (1-Week to All-Time)
Nightly batch job at 1:00 AM: for each user with positions, compute close-of-day equity = cash + Σ(qty × closing_price). Store in portfolio_snapshots table (user_id, date, equity). 25M users × 252 days × 100 bytes = 630 GB/year. Partition by date, archive >2 years to Parquet on S3.
Corporate Action Adjustments
Stock splits make historical snapshots wrong. When AAPL splits 4:1, historical price data is retroactively adjusted (divided by 4). Affected portfolio snapshots must be recomputed using split-adjusted prices — a batch job triggered by the corporate actions processor.
Six-Layer Idempotency Map
| Layer | Mechanism | Protects Against |
| 1. Mobile Client | UUID idempotency key, local storage | User taps "buy" twice |
| 2. API Gateway | Redis SET NX + cached response | Network retry duplicates |
| 3. Order Service | PostgreSQL UNIQUE constraint | Redis miss, concurrent requests |
| 4. Kafka Producer | enable.idempotence=true, PID+seq | Producer retry sends same message twice |
| 5. Execution Engine | RocksDB processed_order_ids set | Consumer restart reprocesses message |
| 6. Exchange (FIX) | Unique ClOrdID per session/day | FIX session retry sends duplicate order |
Transactional Outbox Pattern
BEGIN;
INSERT INTO orders (...) VALUES (...);
INSERT INTO order_events (...) VALUES (...);
INSERT INTO outbox (topic, key, payload) VALUES ('orders.submitted', user_id, order_json);
COMMIT;
-- Outbox relay polls, publishes to Kafka, marks published=true
-- Alternative: Debezium CDC reads WAL, auto-publishes (no outbox table)
Poison Pill Handling
Unprocessable messages in Kafka (malformed, references deleted user) → Dead Letter Queue. Transient errors (DB timeout): don't commit offset, retry with backoff. Non-retriable errors: send to DLQ, commit offset to unblock partition, alert ops team.
Exactly-once proof: No single point can cause a duplicate trade. No single point can cause a lost trade (that was accepted). Each layer's dedup mechanism is independent, so even if one fails, the next catches it. Belt, suspenders, and a parachute.
SEC Rule 17a-4
Every broker-dealer must preserve all order records, execution records, customer communications, and account records for minimum 6 years (first 2 easily accessible). Format: WORM (Write Once, Read Many) — cannot be altered or deleted. Must be producible within 24 hours of SEC/FINRA request.
Compliance Event Bus
Every service → Kafka audit.* topics → Consumers:
├─ S3 (WORM, Object Lock COMPLIANCE mode, 7-year retention, Parquet)
├─ Elasticsearch (2-year hot store, searchable within seconds)
├─ Compliance Dashboard (real-time monitoring)
└─ Third-Party Archive (FINRA-approved vendor)
Daily volume: 4M orders × 5 events × 1KB = 20 GB/day raw → ~4 GB compressed
Real-Time Surveillance
Apache Flink streaming application consumes from audit.orders, maintains sliding windows per user per symbol, and detects:
Wash Trading
Buy then sell same stock at same price within minutes to create false volume. Flag if same symbol ±0.5% price within 5 minutes.
Spoofing / Layering
Place large limit orders to move price, cancel before fill. Flag if >1000 shares cancelled within 30 seconds, 3+ times/day.
Front-Running
Employee trades same symbol within 15 minutes before large customer order. All employee accounts tagged for monitoring.
Pump and Dump
Social media spike + buy activity by small group + subsequent sell after price increase. Requires external social media feed integration.
Required Filings
Rule 606 (quarterly): where orders were routed, PFOF received. CAT (daily): every order lifecycle event to FINRA. SAR (as needed): suspicious activity within 30 days of detection to FinCEN.