06
Deep Dives — 15 Rabbit Holes
Each rabbit hole explores a critical subsystem at implementation depth — from cache-line-level data structure engineering to regulatory fairness requirements.
The single most performance-critical data structure. Uses a hybrid approach: flat array indexed by price for the hot zone (O(1) lookup, 512KB fits in L2 cache) plus Red-Black Tree for far-from-market orders.
Operation Frequency Drives Optimization
| Operation | Frequency | Target | Complexity |
| Cancel order | 52% of messages | < 0.5 μs | O(1) HashMap + O(1) list remove |
| Add order | 40% | < 1 μs | O(1) array index + O(1) list append |
| Match (trade) | 0.03% | < 3 μs | O(k) fills at best price |
| Read best bid/ask | every operation | < 0.1 μs | O(1) cached pointer |
64-Byte Order Struct (Exactly 1 Cache Line)
struct Order { // hot fields first for prefetcher
int64_t price; // 8B — compared during matching
uint32_t remaining_qty; // 4B — decremented during fills
uint32_t broker_id; // 4B — checked for STP
Order* next; // 8B — linked list traversal
Order* prev; // 8B — linked list removal
uint64_t order_id; // 8B — for execution report
uint64_t cl_ord_id; // 8B — for broker response
uint32_t original_qty; // 4B — cold
uint32_t symbol_id; // 4B — cold
uint64_t timestamp; // 8B — cold
} __attribute__((packed, aligned(64))); // = 64 bytes ✓
Custom Open-Addressing HashMap for O(1) Cancel
1M slots × 16 bytes = 16 MB (fits in L3). Fibonacci hashing with linear probing — at 50% load factor, average 1.5 probes per lookup. 5-10 ns per cancel lookup vs 50-200 ns for std::unordered_map. Pre-allocated pool with free-list allocator on huge pages eliminates all malloc/free in the hot path.
Benchmark: Add: ~22 ns, Cancel: ~30 ns, Match: ~140 ns per order. One core handles 33-45M operations/sec. Our requirement is 100K/sec — 300× headroom.
The core product. Incoming marketable orders sweep through the opposite side of the book, matching against resting orders in FIFO order at each price level.
sequenceDiagram
participant B as Broker
participant G as Gateway
participant M as Matching Engine
participant D as Disruptor
B->>G: FIX NewOrderSingle (BUY 100 AAPL LIMIT $187.46)
G->>M: 64-byte InternalOrder (validated, normalized)
Note over M: Check: $187.46 >= best ask $187.46 → MARKETABLE!
Note over M: Match against resting sell at $187.46 (FIFO)
Note over M: STP check: different brokers ✓
Note over M: Fill 100 shares @ $187.46
M->>D: TRADE event + ORDER_EXECUTED + QUOTE_UPDATE
D-->>G: ExecutionReport (FILLED, liquidity='R' taker)
D-->>G: ExecutionReport to seller (liquidity='A' maker)
Order Type Handling
Market Orders
Always match immediately. No price limit — take whatever is available. Sweep through price levels until fully filled or book exhausted. Never rest on the book.
Limit Orders
Match if marketable (buy price ≥ best ask). Remaining quantity rests on book. IOC: cancel unfilled portion. FOK: pre-check full quantity available before executing any fills.
Stop Orders
Sit in separate trigger table. After every trade, check if trade price crosses any stop triggers. Triggered stops convert to market/limit and enter normal matching — can cause cascades.
Cancel-Replace
Atomic cancel + new order. Price change loses time priority. Qty-down-only keeps priority. New price might be marketable — modify can trigger immediate matching.
Deterministic Replay
Same input sequence must produce same output sequence. Every time. No floating-point (integer prices), no system clock reads (timestamps from input), no randomness, no threads. Any trading day can be replayed and verified — regulatory requirement.
Pre-allocated ring buffer connecting the single-threaded matching engine to 5 downstream consumers. No copying, no locks, no allocation. All consumers read from the same memory with independent cursors.
Ring Buffer: 262,144 slots × 128 bytes = 32 MB (pre-allocated, huge pages)
Producer: Matching engine (single writer, ~1M events/sec)
Consumers (each on own core):
Tier 1: WAL Writer + Replication Sender (no dependencies)
Tier 2: Gateway Response + Market Data (gated behind Tier 1)
Tier 3: Surveillance (gated behind Tier 2)
Cache-Line Padding Eliminates False Sharing
Each cursor is padded to 64 bytes (one cache line). Without padding: producer update invalidates ALL consumer cache lines (~40-70 ns per event). With padding: zero cross-core cache coherence traffic. This single optimization made the Disruptor "impossibly fast."
Dependency Barriers
Gateway cannot send broker ExecutionReport until BOTH WAL and Replication confirm the event is durable. This is the mechanism that achieves zero data loss for acknowledged events without adding latency to the matching engine.
Performance: ~50M messages/sec between threads. 5× faster than any lock-based queue. One event reaches all 5 consumers in ~130 ns total (vs ~500+ ns with standard queues).
The Linux kernel TCP/IP stack adds ~10 μs per packet (interrupts, memory allocation, TCP processing, context switches, data copying). Our entire budget is 10 μs. Kernel networking alone consumes 92% of our budget.
Three Levels of Bypass
External: Solarflare Onload
Intercepts standard socket calls, implements TCP in userspace. Application code unchanged. recv() drops from 9 μs to 1.5 μs. Used for broker FIX connections on gateway servers.
Internal: DPDK / ef_vi (raw UDP)
NIC mapped directly into userspace. Packets polled from DMA ring buffer — no interrupts, no kernel. Receive latency: 0.5 μs. Used for gateway → matching engine internal path.
Replication: RDMA
One-sided writes directly into standby's memory without involving standby CPU. Latency: 1.5 μs same-rack. Used for matching engine → standby replication.
Hardware Timestamping (PTP)
NIC PHY stamps every packet at the wire level with ±5 ns accuracy (GPS-synced PTP clock). This provides regulatory-quality nanosecond timestamps and enables the fairness sequencer.
No database — all state is in memory. Persistence via memory-mapped WAL: append 144-byte entries at ~0.1 μs (memcpy to page cache). OS flushes to NVMe asynchronously. True durability comes from RDMA replication to standby, not local disk.
Recovery Strategy
Cold recovery:
1. Load latest snapshot (~310 MB, ~0.1 sec from NVMe)
2. Replay WAL from snapshot timestamp (~30 sec of events)
3. Replay speed: ~2M events/sec → ~6 seconds total
4. Resume accepting orders
Hot failover (preferred):
Standby takes over in ~5-10 ms
Primary's WAL used only for convenience on restart
Deterministic Replay
Timer events (market open/close, LULD recompute) injected as input messages through the WAL — not from system clock. This ensures replay produces identical order book state. Verified nightly by replaying the full day and comparing every output event.
Pipelined RDMA replication: matching engine doesn't wait (0 μs cost), but gateway consumer IS gated behind replication (+1.5 μs to broker notification). Zero data loss for acknowledged events.
Failover Sequence (~5-10 ms total)
T=0ms Primary crashes
T=3ms Standby detects 3 missed heartbeats (1ms interval)
T=3.5ms Standby fences primary via IPMI power-off
T=3.6ms Standby replays pending events from RDMA buffer
T=4ms Standby promotes to primary (new epoch)
T=5ms Gateways notified, redirect order flow
T=6ms Normal operation resumes
Split-Brain Prevention
Three mechanisms: physical fencing (IPMI power-off of old primary), epoch numbers (gateways reject old-epoch messages), 3-node quorum (primary needs 2/3 votes to continue). If fencing fails → failover aborted rather than risking two primaries.
Unicast to 1,000 subscribers at 1M msg/sec = 1 billion sends/sec = impossible. UDP multicast: send once, switch replicates to all. 50 MB/sec outbound vs 50 GB/sec for unicast.
Channel Architecture
~50 channels, ~100-200 symbols each. Hot symbols (AAPL, SPY) get smaller groups. Each channel has A/B feed on independent switch paths. MoldUDP64 framing with per-channel monotonic sequence numbers for gap detection.
A/B Feed Redundancy
Two completely independent network paths (different NICs, switches, fiber routes). A alone delivers 99.999% of messages. A OR B delivers 99.9999999% (nine 9s). TCP retransmit server handles the ultra-rare case where both feeds lose the same packet.
Adaptive Batching
Trades: flush immediately (0 delay). Quote updates: flush within 10 μs. Depth updates: batch up to 50 μs or 35 messages. Reduces packet rate by 97% while keeping latency-sensitive messages fast.
The defining constraint separating an exchange from any other high-performance system. SEC mandates equal access — latency differences translate directly into money.
Physical: Cable Equalization
Every co-lo path measured by OTDR, padded to equal length (±0.5m = ±2.5 ns). Verified annually by third party.
Temporal: PTP Timestamps
NIC PHY timestamps at ±5 ns accuracy. Two packets 50 ns apart get correctly ordered (30 ns error < 50 ns difference).
Architectural: HW Sequencer
3 μs window collects messages from all gateways, sorts by hardware timestamp. Absorbs all gateway processing variance.
Computational: Determinism
Single-threaded, no float, no clock reads, no randomness. Same input → same output. Regulators can replay any day.
IEX Speed Bump — Alternative Model
350 μs of coiled fiber on all paths. Makes microsecond-level differences irrelevant. 150 ns cable advantage becomes 0.04% of total delay. Our exchange uses the traditional model of minimizing latency for everyone equally.
At market open and close, a completely different algorithm finds a single equilibrium price maximizing matched volume. The closing auction price is the most important price in finance — used by every index fund, ETF, mutual fund, and option for valuation.
Algorithm: Cumulative Supply/Demand Curves
For each possible price P:
buy_demand(P) = MOO buys + all limit buys at P or higher
sell_supply(P) = MOC sells + all limit sells at P or lower
matched(P) = min(buy_demand, sell_supply)
Clearing price = P that maximizes matched(P)
Tie-breakers: minimum imbalance → closest to reference price
All eligible orders execute at the SINGLE clearing price.
NOII (Net Order Imbalance Indicator) published every second
before auction so traders can adjust orders.
NASDAQ's closing cross handles $50-80 billion/day and 10-15% of total US equity volume. Getting the price wrong by $0.01 on AAPL can misvalue ~$168 million in index fund NAV.
Every optimization is rooted in understanding the CPU. L1 access: ~1 ns. L2: ~4 ns. L3: ~14 ns. DRAM: ~80 ns (80× slower than L1). A single DRAM access costs as much time as processing an entire order.
Key Techniques
Cache-Line Alignment
All structs are 64 bytes. No straddling. Predictable access patterns. Hardware prefetcher handles sequential scans.
Huge Pages
2 MB pages eliminate TLB misses. 32 MB order pool = 16 huge pages. All fit in TLB. Regular 4KB pages = 8,192 entries → TLB thrashing.
Core Isolation
isolcpus + nohz_full + IRQ affinity to core 0. Matching cores get zero interrupts, zero timer ticks, zero scheduling.
NUMA Pinning
All order book memory on local NUMA node. Remote access: 140 ns vs 80 ns local = 75% slowdown.
rdtsc for Nanosecond Timestamps
~7 ns to read TSC register (no syscall). clock_gettime: ~20-50 ns (syscall overhead). At 1M timestamps/sec, saves ~30 ms/sec. Used for internal timing; NIC PTP clock for regulatory timestamps.
Stop orders don't sit on the visible book — they sit in a separate trigger table indexed by trigger price (same flat-array optimization). After every trade, check if trade price crosses any triggers.
Most trades don't trigger stops: one integer comparison (~2 ns). Only when a trade crosses a trigger does real work happen — triggered stops convert to market/limit orders and enter normal matching, potentially causing cascading triggers (positive feedback loop). LULD circuit breakers are the safety valve.
Market makers run 20+ strategies simultaneously. STP prevents wash trades (matching against yourself). Checked in the innermost matching loop — one integer comparison per potential match. Modes: cancel resting, cancel incoming, cancel both, cancel smallest. Per-broker configuration, supports sub-broker group-level STP.
Performance impact: branch predictor learns 99%+ of matches are between different brokers → ~0 penalty for the common case.
Every fill tagged as maker (resting, earns ~$0.002/share rebate) or taker (incoming, pays ~$0.003/share fee). Exchange net: $0.001/share × 3B shares/day = $3M/day revenue. Volume-tiered pricing: high-volume brokers get better rates. Fee model drives entire market microstructure — brokers route specifically to earn rebates.
Exchange cannot fail during extreme events (GME squeeze: 10-50× volume, Flash Crash: 20× message rate). Stress testing: replay historical extreme days at 2-10× speed. Capacity headroom: normal <30%, peak <60%, extreme <90%. Graceful degradation: throttle high-OTR brokers first, reject new orders while processing existing ones, widen LULD bands, never crash or lose data.
Core matching engine is generic, parameterized by: matching algorithm (FIFO vs pro-rata), tick size, trading hours, circuit breaker rules, fee schedule. Options: 3M order books (10K stocks × 300 series), pro-rata matching. Futures: 24-hour, margin model. Crypto: 24/7/365, no regulation. Cross-asset combo orders: atomic matching across multiple books — architecturally hardest extension.