06
Deep Dive — Global Priority Ordering Across Partitions
The moment you partition for throughput, you break global ordering. This is the fundamental tension of distributed priority queues. Here are three approaches, from simple to planetary:
Approach 1: Single Partition
One node, one sorted set
ZPOPMIN gives the highest-priority item instantly. Works beautifully up to ~10K/sec. Beyond that, you're stuck — vertical scaling has a ceiling, and this is your single point of failure. Does not scale.
Approach 2: Priority-Level Sharding ✓
Partition by priority band — our chosen approach
Instead of partitioning by message key, partition by priority level. Priorities 7–9 go to a dedicated "hot" partition. Priorities 0–3 go to "cold". This is elegant because:
• High-priority consumers only talk to the hot partition — no cross-partition coordination needed for the most latency-sensitive traffic
• The hot partition is small (10% of total traffic) — a single node or small replica set handles it easily
• Cold partitions can be sharded further by key, since ordering among low-priority messages is less critical
The tradeoff: A consumer asking for "the single highest-priority message globally" must waterfall: drain hot first, then warm, then cold. This is a simple sequential check, not a complex merge.
Approach 3: Coordinator Merge (Planet-Scale)
Hash-partition everything + coordinator with min-heap
For millions/sec: hash-partition all messages and run a Priority Coordinator that maintains a min-heap of each partition's head element. The coordinator pops the heap, issues ZPOPMIN to the winning partition, and pushes the new head back. The coordinator stores zero messages — only partition heads. Even with 1,000 partitions, the heap has 1,000 entries → O(log 1000) ≈ 10 comparisons. Not a bottleneck.
Composite Score Packing
Inside each partition's ZSET, priority and timestamp are packed into a single 64-bit float score:
score = (MAX_PRIORITY - priority) × 10¹³ + timestamp_microseconds
// Priority 9, ts 1712678400000000:
score = (9-9) × 10¹³ + 1712678400000000 = 1712678400000000
// Priority 7, ts 1712678400000001:
score = (9-7) × 10¹³ + 1712678400000001 = 20001712678400000001
// Lower score = higher priority = dequeued first
// ZPOPMIN always returns the most urgent, oldest message
Consumer Dequeue Sequence
sequenceDiagram
participant C as Consumer
participant CO as Coordinator
participant P2 as Partition 2 (Hot)
participant P0 as Partition 0 (Cold)
C->>CO: GET /dequeue (long-poll)
CO->>CO: Check heap top → (pri=9, part=2)
CO->>P2: ZPOPMIN
P2-->>CO: msg_8f3a (pri=9)
P2-->>CO: New head: (pri=7, ts=1010)
CO->>CO: Push (pri=7, part=2) into heap
CO-->>C: msg_8f3a + receipt_handle
Note over C: Visibility timeout starts (30s)
C->>CO: ACK msg_8f3a
CO->>P2: Delete msg_8f3a permanently
The Ordering Gap
Between the coordinator popping the heap and the new head being reported, a higher-priority message could arrive on that partition unseen. This window is typically < 5ms. The ordering is near-strict, not perfectly strict. For task scheduling, alert triage, and ad ranking, this is perfectly acceptable.
Coordinator failure recovery: The coordinator is stateless — it can be rebuilt by querying each partition's head. On failover via ZooKeeper leader election, the new coordinator scans all partition heads (~50ms with 1,000 partitions) and resumes.