High-Throughput Stock Exchange

Design a NASDAQ-scale order-matching system that processes millions of orders per second with single-digit microsecond latency, maintains deterministic fairness guarantees, and broadcasts every state change to thousands of participants in real-time.

Matching EngineMicrosecond LatencyKernel BypassLMAX DisruptorUDP MulticastFIX ProtocolDeterministicCache-Line Engineering
01

Problem Statement

Design a centralized order-matching system — the core of a stock exchange like NASDAQ or NYSE. The exchange receives buy and sell orders from thousands of brokers, matches them according to price-time priority rules, and broadcasts results to all market participants.

This is fundamentally a systems programming problem, not a distributed systems problem. The matching engine itself runs on a single thread per symbol to eliminate all synchronization overhead. The distributed aspects come from gateway servers, replication, and market data broadcasting.

Core question: How do you process millions of orders per second with single-digit microsecond latency, while maintaining deterministic fairness guarantees and zero data loss?

Three planes with completely different design philosophies: the Order Gateway is distributed and stateless, the Matching Engine is single-threaded and bare-metal optimized, and the Market Data Plane uses hardware-level multicast for simultaneous delivery to all subscribers.

How is this different from Robinhood?

Robinhood (broker) accepts orders from users, validates buying power, and routes to exchanges. The exchange receives orders from thousands of brokers simultaneously, maintains an order book per symbol, runs a matching engine that pairs buyers with sellers, and broadcasts trades/quotes to the entire market. Robinhood's hard problem was consistency of money movement. The exchange's hard problem is throughput and latency at extreme scale with deterministic fairness.

02

Requirements

Functional Requirements

Order Management

  • Accept orders via FIX protocol from member broker-dealers — market, limit, stop, stop-limit, IOC, FOK, GTC, day orders
  • Cancel and modify (cancel-replace) resting orders with O(1) lookup
  • Match orders using price-time priority (FIFO) — best price first, earliest order first within same price
  • Support auction mechanisms — opening cross, closing cross, halt auctions
  • Manage trading sessions — pre-market, continuous, auction, after-hours

Market Data

  • Publish trade reports and quote updates via UDP multicast to all subscribers simultaneously
  • Level 1 (top of book) feed to SIP for NBBO computation
  • Level 2 (depth of book) proprietary feed using ITCH protocol
  • A/B feed redundancy on independent network paths
  • Retransmit server (TCP) for gap recovery; snapshot service for full-book recovery

Regulatory

  • LULD circuit breakers — halt trading if price moves >5% in 5 minutes
  • Reg NMS Rule 611 — trade-through prevention against NBBO
  • Self-trade prevention — detect and prevent wash trades within same broker
  • Maker-taker fee tagging on every fill
  • CAT (Consolidated Audit Trail) reporting with nanosecond timestamps
  • Deterministic replay for regulatory verification

Non-Functional Requirements

Latency

  • Order-to-ack: < 10 μs median
  • Order-to-match: < 15 μs median
  • p99: < 50 μs
  • p99.9: < 100 μs
  • Jitter: < 5 μs standard deviation

Throughput & Availability

  • Sustained: 1M orders/sec, peak 3-5M
  • Per-symbol hot: 100K orders/sec (SPY/AAPL)
  • 99.999% uptime during market hours (~59 sec downtime/year)
  • Recovery: < 5 seconds after crash via standby failover
  • Zero data loss for acknowledged events

Technology elimination: At 10μs total budget, no databases, no Redis, no Kafka, no HTTP, no JSON, no GC in the hot path. Everything in the matching engine must be in-memory, pre-allocated, lock-free, and zero-copy.

03

Scale Estimation

Anchored on NASDAQ-scale: 15-20% of total US equity volume. 2.5B orders/day, but only 7M trades (OTR ~150:1). The order book is overwhelmingly an add-and-cancel machine — cancels (52%) are more frequent than adds (40%).

1M/sec
Sustained Orders
3-5M/sec
Peak Orders
100K/sec
Per-Symbol Hot
10 μs
Time Budget/Order
~16 GB
Total Book Memory
1-5M/sec
Market Data Out
~2,000
FIX Sessions
144 MB/sec
WAL Write Rate

Per-Symbol Hot Path Budget

SPY at 100K orders/sec → 10 μs per order → ~35,000 CPU cycles at 3.5 GHz. A single L3 cache miss costs ~150 cycles. This means every data access must hit L1/L2 cache. The order book data structure design is driven entirely by cache-line optimization.

Symbol Partitioning Strategy

Static core assignment: top 10 symbols get dedicated cores. Next 90 symbols share 10 cores (~9 each). Next 900 share 10 cores. Remaining 7,000 share 10 cores. Each core runs one matching thread — no context switching, no locks, no shared state between cores.

04

API Design

No REST. No JSON. No HTTP. Everything is binary messages over persistent TCP connections using the FIX protocol externally, and custom 64-byte structs internally. HTTP overhead alone (~80-2,330μs) exceeds our entire 10μs budget.

Inbound: FIX Protocol (Broker → Exchange)

NewOrderSingle (MsgType=D):
  cl_ord_id, symbol, side(Buy/Sell), quantity, ord_type(Market/Limit),
  price (integer, hundredths of cents), time_in_force(Day/GTC/IOC/FOK)

OrderCancelRequest (MsgType=F):
  cl_ord_id, orig_cl_ord_id, symbol, side

CancelReplaceRequest (MsgType=G):
  cl_ord_id, orig_cl_ord_id, symbol, side, new_qty, new_price
  → Price change loses time priority; qty-down-only keeps priority

Internal: 64-byte Cache-Line-Aligned Struct

struct InternalOrder {   // exactly 1 cache line = 64 bytes
  uint8_t  msg_type;     // NEW, CANCEL, REPLACE
  uint8_t  side;         // BUY=1, SELL=2
  uint8_t  ord_type;     // MARKET=1, LIMIT=2, STOP=3
  uint8_t  tif;          // DAY=0, GTC=1, IOC=3, FOK=4
  uint32_t symbol_id;    // integer ID (not string — gateway maps at logon)
  uint64_t order_id;     // exchange-assigned
  uint64_t cl_ord_id;    // broker's ID (hashed to uint64)
  uint32_t broker_id;    // integer broker ID
  uint32_t quantity;
  int64_t  price;        // hundredths of cents ($187.45 = 18745000)
  uint64_t timestamp;    // nanoseconds (NIC hardware PTP clock)
  uint64_t orig_order_id;// for cancel/replace
} __attribute__((packed, aligned(64)));

Outbound: ExecutionReport (Exchange → Broker)

ExecutionReport (MsgType=8):
  exec_type: NEW | PARTIAL_FILL | FILLED | CANCELLED | REJECTED
  last_qty, last_px, leaves_qty, cum_qty, avg_px
  liquidity_indicator: 'A'=Added(maker) | 'R'=Removed(taker)
  contra_broker: other side's broker ID (for clearing)

Market Data: ITCH + MoldUDP64

ITCH message types (subscribers reconstruct full order book):
  ADD_ORDER (36 bytes): order_ref, side, shares, symbol, price
  ORDER_EXECUTED (31 bytes): order_ref, shares, match_number
  ORDER_CANCELLED (23 bytes): order_ref, cancelled_shares
  ORDER_REPLACE (35 bytes): old_ref, new_ref, shares, price
  TRADE (44 bytes): non-displayable execution

MoldUDP64 framing: session_id(10) + sequence(8) + msg_count(2)
  Multiple ITCH messages batched per UDP packet (up to MTU 1500)
  A/B feed redundancy on independent switch paths
05

High-Level Architecture

Three planes with completely different scaling models: Gateway (stateless, horizontal), Matching Engine (stateful, vertical + partitioning), Market Data (stateless formatting, hardware multicast).

GATEWAY PLANE — LEFT · MATCHING ENGINE — CENTER · MARKET DATA — RIGHT BrokersFIX Sessions GatewaysParse · Validate SequencerHW Timestamp Sort Matching EngineSingle-Thread/Symbol DisruptorLock-Free Ring Buffer WAL Writermmap · NVMe ReplicationRDMA · 1.5μs GW ResponseFIX ExecReport MD FormatterITCH · SIP UDP MulticastA/B Feeds SubscribersHFT · Vendors Hot StandbyIdentical Books FIX/TCPNormalizeSortedEventsTier 1ConsumerMulticastRDMABarrier

Infrastructure

ComponentCountKey Spec
Gateway Servers4-816-32 cores, 25GbE DPDK, Solarflare Onload
Matching Engine Primary140-64 cores, 512GB RAM, 40GbE RDMA, NVMe
Matching Engine Standby1Identical to primary, RDMA-connected
Market Data Servers4-616-32 cores, 40GbE multicast-capable
Retransmit / Snapshot2-4256GB RAM (message buffer)
Network Switches6-8Arista 7130, cut-through, <500ns latency
PTP Grandmaster2GPS-synced, ±30ns to UTC
Request Flow — Step Through
Broker (co-lo)Gateway NICFIX ParseValidateSequencerMatching EngineOrder Book MatchDisruptor PublishWAL + ReplicationGateway ResponseBroker Receives
Click Next Step to walk through the request flow.
06

Deep Dives — 15 Rabbit Holes

Each rabbit hole explores a critical subsystem at implementation depth — from cache-line-level data structure engineering to regulatory fairness requirements.

01

Order Book Data Structure

The single most performance-critical data structure. Uses a hybrid approach: flat array indexed by price for the hot zone (O(1) lookup, 512KB fits in L2 cache) plus Red-Black Tree for far-from-market orders.

Operation Frequency Drives Optimization

OperationFrequencyTargetComplexity
Cancel order52% of messages< 0.5 μsO(1) HashMap + O(1) list remove
Add order40%< 1 μsO(1) array index + O(1) list append
Match (trade)0.03%< 3 μsO(k) fills at best price
Read best bid/askevery operation< 0.1 μsO(1) cached pointer

64-Byte Order Struct (Exactly 1 Cache Line)

struct Order {  // hot fields first for prefetcher
  int64_t price;           // 8B — compared during matching
  uint32_t remaining_qty;  // 4B — decremented during fills
  uint32_t broker_id;      // 4B — checked for STP
  Order* next;             // 8B — linked list traversal
  Order* prev;             // 8B — linked list removal
  uint64_t order_id;       // 8B — for execution report
  uint64_t cl_ord_id;      // 8B — for broker response
  uint32_t original_qty;   // 4B — cold
  uint32_t symbol_id;      // 4B — cold
  uint64_t timestamp;      // 8B — cold
} __attribute__((packed, aligned(64))); // = 64 bytes ✓

Custom Open-Addressing HashMap for O(1) Cancel

1M slots × 16 bytes = 16 MB (fits in L3). Fibonacci hashing with linear probing — at 50% load factor, average 1.5 probes per lookup. 5-10 ns per cancel lookup vs 50-200 ns for std::unordered_map. Pre-allocated pool with free-list allocator on huge pages eliminates all malloc/free in the hot path.

Benchmark: Add: ~22 ns, Cancel: ~30 ns, Match: ~140 ns per order. One core handles 33-45M operations/sec. Our requirement is 100K/sec — 300× headroom.

02

Matching Algorithm — Price-Time Priority

The core product. Incoming marketable orders sweep through the opposite side of the book, matching against resting orders in FIFO order at each price level.

sequenceDiagram participant B as Broker participant G as Gateway participant M as Matching Engine participant D as Disruptor B->>G: FIX NewOrderSingle (BUY 100 AAPL LIMIT $187.46) G->>M: 64-byte InternalOrder (validated, normalized) Note over M: Check: $187.46 >= best ask $187.46 → MARKETABLE! Note over M: Match against resting sell at $187.46 (FIFO) Note over M: STP check: different brokers ✓ Note over M: Fill 100 shares @ $187.46 M->>D: TRADE event + ORDER_EXECUTED + QUOTE_UPDATE D-->>G: ExecutionReport (FILLED, liquidity='R' taker) D-->>G: ExecutionReport to seller (liquidity='A' maker)

Order Type Handling

Market Orders

Always match immediately. No price limit — take whatever is available. Sweep through price levels until fully filled or book exhausted. Never rest on the book.

Limit Orders

Match if marketable (buy price ≥ best ask). Remaining quantity rests on book. IOC: cancel unfilled portion. FOK: pre-check full quantity available before executing any fills.

Stop Orders

Sit in separate trigger table. After every trade, check if trade price crosses any stop triggers. Triggered stops convert to market/limit and enter normal matching — can cause cascades.

Cancel-Replace

Atomic cancel + new order. Price change loses time priority. Qty-down-only keeps priority. New price might be marketable — modify can trigger immediate matching.

Deterministic Replay

Same input sequence must produce same output sequence. Every time. No floating-point (integer prices), no system clock reads (timestamps from input), no randomness, no threads. Any trading day can be replayed and verified — regulatory requirement.

03

LMAX Disruptor — Lock-Free Fan-Out

Pre-allocated ring buffer connecting the single-threaded matching engine to 5 downstream consumers. No copying, no locks, no allocation. All consumers read from the same memory with independent cursors.

Ring Buffer: 262,144 slots × 128 bytes = 32 MB (pre-allocated, huge pages)
Producer: Matching engine (single writer, ~1M events/sec)
Consumers (each on own core):
  Tier 1: WAL Writer + Replication Sender (no dependencies)
  Tier 2: Gateway Response + Market Data (gated behind Tier 1)
  Tier 3: Surveillance (gated behind Tier 2)

Cache-Line Padding Eliminates False Sharing

Each cursor is padded to 64 bytes (one cache line). Without padding: producer update invalidates ALL consumer cache lines (~40-70 ns per event). With padding: zero cross-core cache coherence traffic. This single optimization made the Disruptor "impossibly fast."

Dependency Barriers

Gateway cannot send broker ExecutionReport until BOTH WAL and Replication confirm the event is durable. This is the mechanism that achieves zero data loss for acknowledged events without adding latency to the matching engine.

Performance: ~50M messages/sec between threads. 5× faster than any lock-based queue. One event reaches all 5 consumers in ~130 ns total (vs ~500+ ns with standard queues).

04

Kernel Bypass Networking

The Linux kernel TCP/IP stack adds ~10 μs per packet (interrupts, memory allocation, TCP processing, context switches, data copying). Our entire budget is 10 μs. Kernel networking alone consumes 92% of our budget.

Three Levels of Bypass

External: Solarflare Onload

Intercepts standard socket calls, implements TCP in userspace. Application code unchanged. recv() drops from 9 μs to 1.5 μs. Used for broker FIX connections on gateway servers.

Internal: DPDK / ef_vi (raw UDP)

NIC mapped directly into userspace. Packets polled from DMA ring buffer — no interrupts, no kernel. Receive latency: 0.5 μs. Used for gateway → matching engine internal path.

Replication: RDMA

One-sided writes directly into standby's memory without involving standby CPU. Latency: 1.5 μs same-rack. Used for matching engine → standby replication.

Hardware Timestamping (PTP)

NIC PHY stamps every packet at the wire level with ±5 ns accuracy (GPS-synced PTP clock). This provides regulatory-quality nanosecond timestamps and enables the fairness sequencer.

05

WAL Design & Crash Recovery

No database — all state is in memory. Persistence via memory-mapped WAL: append 144-byte entries at ~0.1 μs (memcpy to page cache). OS flushes to NVMe asynchronously. True durability comes from RDMA replication to standby, not local disk.

Recovery Strategy

Cold recovery:
  1. Load latest snapshot (~310 MB, ~0.1 sec from NVMe)
  2. Replay WAL from snapshot timestamp (~30 sec of events)
  3. Replay speed: ~2M events/sec → ~6 seconds total
  4. Resume accepting orders

Hot failover (preferred):
  Standby takes over in ~5-10 ms
  Primary's WAL used only for convenience on restart

Deterministic Replay

Timer events (market open/close, LULD recompute) injected as input messages through the WAL — not from system clock. This ensures replay produces identical order book state. Verified nightly by replaying the full day and comparing every output event.

06

Replication & Failover

Pipelined RDMA replication: matching engine doesn't wait (0 μs cost), but gateway consumer IS gated behind replication (+1.5 μs to broker notification). Zero data loss for acknowledged events.

Failover Sequence (~5-10 ms total)

T=0ms    Primary crashes
T=3ms    Standby detects 3 missed heartbeats (1ms interval)
T=3.5ms  Standby fences primary via IPMI power-off
T=3.6ms  Standby replays pending events from RDMA buffer
T=4ms    Standby promotes to primary (new epoch)
T=5ms    Gateways notified, redirect order flow
T=6ms    Normal operation resumes

Split-Brain Prevention

Three mechanisms: physical fencing (IPMI power-off of old primary), epoch numbers (gateways reject old-epoch messages), 3-node quorum (primary needs 2/3 votes to continue). If fencing fails → failover aborted rather than risking two primaries.

07

Market Data Multicast

Unicast to 1,000 subscribers at 1M msg/sec = 1 billion sends/sec = impossible. UDP multicast: send once, switch replicates to all. 50 MB/sec outbound vs 50 GB/sec for unicast.

Channel Architecture

~50 channels, ~100-200 symbols each. Hot symbols (AAPL, SPY) get smaller groups. Each channel has A/B feed on independent switch paths. MoldUDP64 framing with per-channel monotonic sequence numbers for gap detection.

A/B Feed Redundancy

Two completely independent network paths (different NICs, switches, fiber routes). A alone delivers 99.999% of messages. A OR B delivers 99.9999999% (nine 9s). TCP retransmit server handles the ultra-rare case where both feeds lose the same packet.

Adaptive Batching

Trades: flush immediately (0 delay). Quote updates: flush within 10 μs. Depth updates: batch up to 50 μs or 35 messages. Reduces packet rate by 97% while keeping latency-sensitive messages fast.

08

Fairness Engineering

The defining constraint separating an exchange from any other high-performance system. SEC mandates equal access — latency differences translate directly into money.

Physical: Cable Equalization

Every co-lo path measured by OTDR, padded to equal length (±0.5m = ±2.5 ns). Verified annually by third party.

Temporal: PTP Timestamps

NIC PHY timestamps at ±5 ns accuracy. Two packets 50 ns apart get correctly ordered (30 ns error < 50 ns difference).

Architectural: HW Sequencer

3 μs window collects messages from all gateways, sorts by hardware timestamp. Absorbs all gateway processing variance.

Computational: Determinism

Single-threaded, no float, no clock reads, no randomness. Same input → same output. Regulators can replay any day.

IEX Speed Bump — Alternative Model

350 μs of coiled fiber on all paths. Makes microsecond-level differences irrelevant. 150 ns cable advantage becomes 0.04% of total delay. Our exchange uses the traditional model of minimizing latency for everyone equally.

09

Auction Mechanisms

At market open and close, a completely different algorithm finds a single equilibrium price maximizing matched volume. The closing auction price is the most important price in finance — used by every index fund, ETF, mutual fund, and option for valuation.

Algorithm: Cumulative Supply/Demand Curves

For each possible price P:
  buy_demand(P) = MOO buys + all limit buys at P or higher
  sell_supply(P) = MOC sells + all limit sells at P or lower
  matched(P) = min(buy_demand, sell_supply)

Clearing price = P that maximizes matched(P)
Tie-breakers: minimum imbalance → closest to reference price

All eligible orders execute at the SINGLE clearing price.
NOII (Net Order Imbalance Indicator) published every second
before auction so traders can adjust orders.

NASDAQ's closing cross handles $50-80 billion/day and 10-15% of total US equity volume. Getting the price wrong by $0.01 on AAPL can misvalue ~$168 million in index fund NAV.

10

Mechanical Sympathy — CPU Architecture

Every optimization is rooted in understanding the CPU. L1 access: ~1 ns. L2: ~4 ns. L3: ~14 ns. DRAM: ~80 ns (80× slower than L1). A single DRAM access costs as much time as processing an entire order.

Key Techniques

Cache-Line Alignment

All structs are 64 bytes. No straddling. Predictable access patterns. Hardware prefetcher handles sequential scans.

Huge Pages

2 MB pages eliminate TLB misses. 32 MB order pool = 16 huge pages. All fit in TLB. Regular 4KB pages = 8,192 entries → TLB thrashing.

Core Isolation

isolcpus + nohz_full + IRQ affinity to core 0. Matching cores get zero interrupts, zero timer ticks, zero scheduling.

NUMA Pinning

All order book memory on local NUMA node. Remote access: 140 ns vs 80 ns local = 75% slowdown.

rdtsc for Nanosecond Timestamps

~7 ns to read TSC register (no syscall). clock_gettime: ~20-50 ns (syscall overhead). At 1M timestamps/sec, saves ~30 ms/sec. Used for internal timing; NIC PTP clock for regulatory timestamps.

11

Stop Order & Triggered Order Management

Stop orders don't sit on the visible book — they sit in a separate trigger table indexed by trigger price (same flat-array optimization). After every trade, check if trade price crosses any triggers.

Most trades don't trigger stops: one integer comparison (~2 ns). Only when a trade crosses a trigger does real work happen — triggered stops convert to market/limit orders and enter normal matching, potentially causing cascading triggers (positive feedback loop). LULD circuit breakers are the safety valve.

12

Self-Trade Prevention

Market makers run 20+ strategies simultaneously. STP prevents wash trades (matching against yourself). Checked in the innermost matching loop — one integer comparison per potential match. Modes: cancel resting, cancel incoming, cancel both, cancel smallest. Per-broker configuration, supports sub-broker group-level STP.

Performance impact: branch predictor learns 99%+ of matches are between different brokers → ~0 penalty for the common case.

13

Fee Engine — Maker-Taker Economics

Every fill tagged as maker (resting, earns ~$0.002/share rebate) or taker (incoming, pays ~$0.003/share fee). Exchange net: $0.001/share × 3B shares/day = $3M/day revenue. Volume-tiered pricing: high-volume brokers get better rates. Fee model drives entire market microstructure — brokers route specifically to earn rebates.

14

Capacity Planning & Stress Testing

Exchange cannot fail during extreme events (GME squeeze: 10-50× volume, Flash Crash: 20× message rate). Stress testing: replay historical extreme days at 2-10× speed. Capacity headroom: normal <30%, peak <60%, extreme <90%. Graceful degradation: throttle high-OTR brokers first, reject new orders while processing existing ones, widen LULD bands, never crash or lose data.

15

Multi-Asset Extension

Core matching engine is generic, parameterized by: matching algorithm (FIFO vs pro-rata), tick size, trading hours, circuit breaker rules, fee schedule. Options: 3M order books (10K stocks × 300 series), pro-rata matching. Futures: 24-hour, margin model. Crypto: 24/7/365, no regulation. Cross-asset combo orders: atomic matching across multiple books — architecturally hardest extension.

07

Key Design Decisions & Tradeoffs

1. Single-Threaded vs Multi-Threaded Matching

✓ Chosen

Single-Threaded per Symbol

No locks, no synchronization, no cache coherence traffic. Eliminates ALL concurrency overhead. FIFO ordering guaranteed by construction. Deterministic replay trivially correct.

✗ Alternative

Multi-Threaded per Symbol

Could use more cores per hot symbol. But: lock contention, non-deterministic ordering, cache-line ping-pong. Net throughput LOWER than single-threaded for sequential workloads.

2. Flat Array vs Tree for Price Levels

✓ Chosen

Hybrid: Flat Array + RB-Tree

O(1) price lookup for hot zone (512KB in L2). O(log n) tree for far-from-market orders. 99.9% of operations hit the array. Best of both worlds.

✗ Alternative

Tree Only

O(log n) everything. Simpler. But pointer-chasing defeats cache prefetcher — 5-25× slower for the common case.

3. Pipelined vs Synchronous Replication

✓ Chosen

Pipelined + Gateway Barrier

Matching engine: 0 μs wait. Broker notification: +1.5 μs (gated behind replication). Zero data loss for acknowledged events. Best of both worlds.

✗ Alternative

Fully Synchronous

Matching engine waits for standby ACK (+1.5 μs per event). Safest possible. But adds latency to the matching engine itself, not just to broker notification.

4. Kernel Bypass Hybrid vs Full Bypass

✓ Chosen

Hybrid: Onload External, DPDK Internal

Solarflare Onload for broker TCP (compatible, no code change). Raw DPDK/ef_vi for internal paths (maximum speed). RDMA for replication.

✗ Alternative

Full DPDK Everywhere

Maximum speed but must implement TCP in userspace. Massive engineering effort. Brokers expect standard TCP/FIX.

08

What Can Go Wrong

Split Brain — Both Sides Think They're Primary

Network partition causes standby to promote while primary is alive. Mitigation: IPMI fencing + epoch numbers + 3-node quorum. If fencing fails: abort failover.

Stop Order Cascade

Large price move triggers stop-losses, which trigger more stops, causing a flash crash. Mitigation: LULD circuit breakers halt trading if price moves >5% in 5 minutes.

Market Data Packet Loss

Subscriber's local order book diverges from reality. Mitigation: A/B feed redundancy (nine 9s delivery), TCP retransmit server for gaps, snapshot service for full recovery.

Gateway Processing Variance

Different gateways process same message type at different speeds → unfair advantage. Mitigation: Hardware-timestamp sequencer with 3μs window sorts all messages by true arrival time.

Matching Engine Crash

All order books lost (in-memory only). Mitigation: Hot standby failover in 5-10 ms. Cold recovery from snapshot + WAL replay in ~6 seconds.

GC Pause (if Java)

10-50 ms pause stalls matching for 10,000-50,000 orders. Mitigation: Zero allocation in hot path, pre-allocated pools, primitive types, Disruptor pattern. LMAX achieves 0 GC during market hours in Java.

NVMe Write Stall

SSD garbage collection causes ~100 μs write latency spike. Mitigation: WAL writer is a Disruptor consumer on a separate core. Matching engine is decoupled from disk I/O. Disruptor absorbs ~262 ms of lag.

Anti-patterns

🚫
Match orders across threads for parallelism

Breaks price-time priority; non-deterministic fills.

✓ Better: Single-threaded matching per symbol; state journaled; deterministic replay.
🚫
Lock rows in Postgres for each fill

Nowhere near the required throughput (10-100k trades/sec/symbol).

✓ Better: In-memory order book; fills written via journal, settled async to DB.
🚫
Use message queue (Kafka) for order routing

Ordering guarantee isn't strict enough; spikes add latency.

✓ Better: Low-latency sequencer with durable journal; Kafka for post-trade analytics only.
09

Interview Tips

💡
Lead with the three-plane split. "This system has three fundamentally different planes — a distributed gateway, a single-threaded matching engine, and a multicast market data pipeline. Each has a different scaling model." This immediately signals deep architectural understanding.
Explain WHY single-threaded. "Single-threaded eliminates all synchronization overhead — no locks, no CAS, no cache coherence traffic. For sequential workloads like order matching, this is FASTER than multi-threaded, not slower." Counter-intuitive but correct.
🎯
Name the Disruptor. "I'd use the LMAX Disruptor pattern — a pre-allocated ring buffer with independent consumer cursors and cache-line-padded sequences." Naming a specific, well-known pattern shows you've studied real exchange architectures.
📋
Mention the OTR ratio. "Order-to-trade ratio is ~150:1 — the book is overwhelmingly an add-and-cancel machine, not a matching machine. Cancel performance matters as much as match performance." Shows you understand real market microstructure.
🔑
Talk about fairness as a constraint. "Cable length equalization, hardware timestamps with PTP, and a sequencer with a 3μs window — fairness isn't just nice-to-have, it's legally mandated by the SEC." This separates you from generic high-performance system answers.
🏗️
Know why kernel bypass matters. "The Linux kernel adds ~10μs per packet. Our budget is 10μs total. Kernel bypass via DPDK/Onload reduces network I/O to ~0.5μs, making microsecond matching physically possible." Concrete numbers beat hand-waving.
11

Evolution

How this design grows from a simple matching engine to a multi-asset global exchange.

1

MVP — Single-Symbol Matching Engine

One thread, one order book, one symbol. HashMap + linked list. Standard TCP networking. File-based WAL. Process-level crash recovery. Proves the matching algorithm works correctly.

2

Multi-Symbol with Partitioning

Symbol-based partitioning across CPU cores. Pre-allocated memory pools. Flat array price levels. Custom open-addressing HashMap. Kernel bypass on internal paths (DPDK). Basic UDP multicast for market data.

3

Production Exchange (Current Design)

Full Disruptor fan-out, RDMA replication with hot standby, hardware-timestamp sequencer for fairness, A/B multicast feeds with MoldUDP64/ITCH, PTP time synchronization, auction mechanisms, LULD circuit breakers, co-location facility with cable equalization.

4

Multi-Asset Global Exchange

Options (pro-rata matching, 3M order books), futures (24-hour, margin model), cross-asset combo orders, multi-data-center with active-active gateway routing, FPGA-accelerated matching for sub-microsecond latency, ML-based surveillance system.

Next up