System Design — 015

Distributed Priority Queue

Design a horizontally scalable queue that always delivers the highest-priority message first — like a planetary-scale ER triage system where the most critical patients are always treated next, across every hospital in the network.

distributed-systemspriority-orderingat-least-oncesorted-setspartitioning
01

Problem Statement

Design a distributed priority queue system that accepts messages tagged with priority levels, stores them durably, and delivers them to consumers in strict or near-strict priority order — even under high throughput and node failures. The system must support delayed/scheduled messages, at-least-once delivery with visibility timeouts, and dead-letter queues for poison messages.

Think of a hospital emergency room triage system. Patients arrive constantly, each tagged with an urgency level. The system must always serve the most critical patient next — across multiple hospitals (nodes). A sprained ankle should never be treated before a heart attack, no matter which hospital either patient walked into.

Core question: How do you enforce priority ordering across distributed partitions without creating a single serialization bottleneck that kills throughput?

02

Requirements

Functional Requirements

  • Enqueue with priority — Producers submit messages with a priority level (0–9, higher = more urgent) and optional payload up to 256 KB
  • Dequeue by priority — Consumers receive the globally highest-priority message available, with FIFO ordering within the same priority level
  • Delayed / scheduled messages — Enqueue now, but the message only becomes eligible for delivery at a specified future timestamp
  • At-least-once delivery — Messages are made invisible on dequeue (visibility timeout). If not acknowledged, they reappear for retry
  • Dead-letter queue (DLQ) — Messages that fail processing after N attempts are moved to a separate queue for inspection
  • Multi-tenant isolation — Multiple independent queues with per-tenant quotas and rate limits

Non-Functional Requirements

  • Throughput: 100K+ enqueues/sec sustained, with 3x peak headroom
  • Dequeue latency: p99 < 50ms for highest-priority items
  • Durability: No message loss after enqueue is acknowledged — survive single-node failures
  • Ordering: Near-strict global priority ordering (< 5ms ordering gap acceptable across partitions)
  • Scalability: Horizontal — adding nodes increases capacity linearly
  • Availability: Favour availability over strict consistency (AP in CAP terms)
03

Scale Estimation

Modelling a priority-aware queue used for task scheduling, ad auction ranking, alert triage, or order fulfillment across a large platform — 500M messages/day with 2 KB average message size.

~5,800/s
Avg enqueue rate
~17,400/s
Peak enqueue (3×)
~1 TB/day
Raw storage
~7 TB
7-day retention
~30K/s
Peak dequeue rate
10
Priority levels (0–9)

Derivation

500M messages ÷ 86,400 seconds = ~5,800 enqueues/sec. Peak is 3× average = 17,400/sec. Storage: 500M × 2 KB = 1 TB/day, 7 TB for 7-day retention. Priority distribution is skewed: ~70% land in priorities 0–3, ~20% in 4–6, and only ~10% in 7–9. This skew is critical — it means the "hot" partition handling high-priority messages sees only 10% of total traffic, making it feasible to run as a small, dedicated replica set. Dequeue peak is higher than enqueue peak (30K/s) because consumers burst-poll high-priority buckets aggressively.

The real bottleneck isn't writes — it's dequeue contention. If 50 consumers all ask for "the most urgent item" simultaneously, they're all competing for the same record. Naive designs collapse here.

04

API Design

Enqueue — POST /v1/queues/{queue_id}/messages
{
  "payload":     "{ ... }",              // opaque bytes, up to 256 KB
  "priority":    7,                      // 0–9, higher = more urgent
  "delay_until": "2026-04-09T15:00Z",   // optional: scheduled delivery
  "dedup_key":   "order-12345",          // optional: idempotency key
  "ttl_seconds": 86400,                 // auto-expire if unconsumed
  "metadata":    { "tenant": "acme", "trace_id": "abc-123" }
}

→ 201 Created
{
  "message_id": "msg_8f3a...",
  "enqueued_at": "2026-04-09T12:00:00Z",
  "partition":   3
}
Dequeue — GET /v1/queues/{queue_id}/messages?max=10&wait_seconds=20&min_priority=5
→ 200 OK
{
  "messages": [
    {
      "message_id":         "msg_8f3a...",
      "payload":            "{ ... }",
      "priority":           9,
      "receipt_handle":     "rh_xk29...",   // one-time token for ACK
      "enqueued_at":        "...",
      "attempt":            1,
      "visibility_timeout": 30              // seconds
    }
  ]
}
Acknowledge — DELETE /v1/queues/{queue_id}/messages/{message_id}
Header: X-Receipt-Handle: rh_xk29...
→ 204 No Content
Negative Acknowledge — POST /v1/queues/{queue_id}/messages/{message_id}/nack
{
  "receipt_handle": "rh_xk29...",
  "delay_seconds":  10                  // optional backoff before retry
}
→ 200 OK
Queue Stats — GET /v1/queues/{queue_id}/stats
→ 200 OK
{
  "depth_by_priority": { "9": 12, "8": 45, "7": 230, ... },
  "in_flight":         87,
  "dlq_depth":         3,
  "oldest_message_age_seconds": 142
}

Key API Design Choices

Receipt handle — a one-time token proving you received this message. Prevents double-ACK from a different consumer. Borrowed from SQS.
Long polling (wait_seconds) — consumer blocks server-side until a message arrives or timeout expires. Avoids thundering-herd polling.
min_priority filter — lets you dedicate consumer groups to high-priority work only.
Dedup key — producer-side idempotency. Same key within a window is silently dropped.

05

High-Level Architecture

The architecture splits into three planes: Ingest (stateless API servers), Storage (priority-partitioned sorted sets with WAL durability), and Dispatch (coordinator that merges partition heads to serve globally-ordered dequeues).

Producers Services / Apps Load Balancer Nginx / ALB API Servers Stateless Partition Router By priority tier Hot Partition Pri 7–9 · Redis ZSET Warm Partition Pri 4–6 · Redis ZSET Cold Partition Pri 0–3 · RocksDB Coordinator Min-heap of heads Consumers Long-poll / Pull ZooKeeper Leader election WAL Disk HTTPS REST Route 10% 20% 70% Dequeue Flush

Component Roles

API Servers — Stateless. Auth, validation, dedup check, rate limiting. Route to the correct partition tier.
Partition Router — Routes by priority band: 7–9 → Hot, 4–6 → Warm, 0–3 → Cold.
Hot/Warm/Cold Partitions — Each is a self-contained priority queue using Redis ZSET with composite score (priority × 10¹³ + timestamp). Cold tier uses RocksDB for cost efficiency.
Priority Coordinator — Maintains a min-heap of partition heads. Consumers ask the coordinator, not individual partitions.
WAL — Append-only write-ahead log on disk. On crash, replay to rebuild ZSET.
ZooKeeper/etcd — Leader election for coordinator, partition assignments, consumer group state.

Request Flow — Step Through
ProducerAPI ServerPartition RouterHot PartitionCoordinatorConsumerACKDelete
Click Next Step to walk through the request flow.
06

Deep Dive — Global Priority Ordering Across Partitions

The moment you partition for throughput, you break global ordering. This is the fundamental tension of distributed priority queues. Here are three approaches, from simple to planetary:

Approach 1: Single Partition

One node, one sorted set

ZPOPMIN gives the highest-priority item instantly. Works beautifully up to ~10K/sec. Beyond that, you're stuck — vertical scaling has a ceiling, and this is your single point of failure. Does not scale.

Approach 2: Priority-Level Sharding ✓

Partition by priority band — our chosen approach

Instead of partitioning by message key, partition by priority level. Priorities 7–9 go to a dedicated "hot" partition. Priorities 0–3 go to "cold". This is elegant because:

• High-priority consumers only talk to the hot partition — no cross-partition coordination needed for the most latency-sensitive traffic
• The hot partition is small (10% of total traffic) — a single node or small replica set handles it easily
• Cold partitions can be sharded further by key, since ordering among low-priority messages is less critical

The tradeoff: A consumer asking for "the single highest-priority message globally" must waterfall: drain hot first, then warm, then cold. This is a simple sequential check, not a complex merge.

Approach 3: Coordinator Merge (Planet-Scale)

Hash-partition everything + coordinator with min-heap

For millions/sec: hash-partition all messages and run a Priority Coordinator that maintains a min-heap of each partition's head element. The coordinator pops the heap, issues ZPOPMIN to the winning partition, and pushes the new head back. The coordinator stores zero messages — only partition heads. Even with 1,000 partitions, the heap has 1,000 entries → O(log 1000) ≈ 10 comparisons. Not a bottleneck.

Composite Score Packing

Inside each partition's ZSET, priority and timestamp are packed into a single 64-bit float score:

score = (MAX_PRIORITY - priority) × 10¹³ + timestamp_microseconds

// Priority 9, ts 1712678400000000:
score = (9-9) × 10¹³ + 1712678400000000 = 1712678400000000

// Priority 7, ts 1712678400000001:
score = (9-7) × 10¹³ + 1712678400000001 = 20001712678400000001

// Lower score = higher priority = dequeued first
// ZPOPMIN always returns the most urgent, oldest message

Consumer Dequeue Sequence

sequenceDiagram participant C as Consumer participant CO as Coordinator participant P2 as Partition 2 (Hot) participant P0 as Partition 0 (Cold) C->>CO: GET /dequeue (long-poll) CO->>CO: Check heap top → (pri=9, part=2) CO->>P2: ZPOPMIN P2-->>CO: msg_8f3a (pri=9) P2-->>CO: New head: (pri=7, ts=1010) CO->>CO: Push (pri=7, part=2) into heap CO-->>C: msg_8f3a + receipt_handle Note over C: Visibility timeout starts (30s) C->>CO: ACK msg_8f3a CO->>P2: Delete msg_8f3a permanently

The Ordering Gap

Between the coordinator popping the heap and the new head being reported, a higher-priority message could arrive on that partition unseen. This window is typically < 5ms. The ordering is near-strict, not perfectly strict. For task scheduling, alert triage, and ad ranking, this is perfectly acceptable.

Coordinator failure recovery: The coordinator is stateless — it can be rebuilt by querying each partition's head. On failover via ZooKeeper leader election, the new coordinator scans all partition heads (~50ms with 1,000 partitions) and resumes.

07

Key Design Decisions & Tradeoffs

1. Priority-Level Sharding vs Hash Sharding

✓ Chosen

Priority-Level Sharding

Partition by priority band (hot/warm/cold). High-priority consumers get a dedicated low-contention shard with sub-10ms latency. No coordinator needed for the critical path.

✗ Alternative

Hash Sharding

Uniform load distribution across N nodes. Requires a coordinator to merge priority ordering across all partitions. Adds latency + bottleneck. Better at extreme scale (1M+/sec) when even the hot shard is overwhelmed.

2. Pull (Long-Poll) vs Push (Streaming)

✓ Chosen

Long-Poll (Pull)

Consumer sends GET with wait_seconds=20. Server holds the connection. Consumer controls its own throughput. Natural backpressure — consumers only ask when ready.

✗ Alternative

Push via WebSocket / gRPC Stream

Lower latency, but the server must track consumer capacity and handle backpressure. If a consumer is slow, messages pile up in-flight. More complex operationally.

3. At-Least-Once vs Exactly-Once

✓ Chosen

At-Least-Once + Visibility Timeout

Message becomes invisible for 30s on dequeue. If consumer doesn't ACK, it reappears. Duplicates are possible — consumers must be idempotent (dedup by message_id).

✗ Alternative

Exactly-Once

Requires distributed transactions between queue and consumer's side-effect store. Massive complexity and latency increase. In practice, idempotent consumers + at-least-once is sufficient. SQS, Pub/Sub, and most production queues work this way.

4. Redis ZSET vs Heap-on-Disk (RocksDB)

✓ Chosen

Redis ZSET (Hot) + RocksDB (Cold)

In-memory sorted set gives O(log N) enqueue and ZPOPMIN. WAL on disk ensures durability. RocksDB for cold tiers trades latency for cost — disk reads are acceptable for low-priority messages.

✗ Alternative

All-Redis or All-Disk

All-Redis: expensive at 7 TB retention. All-disk: dequeue on the hot path requires disk I/O, killing p99 latency for urgent messages. The hybrid approach matches cost to priority.

5. Strict vs Near-Strict Global Ordering

✓ Chosen

Near-Strict (< 5ms gap)

Accept a tiny window where ordering might be slightly off across partitions. Avoids a single serialization point. Sufficient for all practical use cases — task scheduling, alerts, ad ranking.

✗ Alternative

Strict Global Ordering

Route all dequeues through a single coordinator with pessimistic locking. Caps throughput at ~50K/sec — the coordinator becomes the ceiling. Acceptable for small-scale only.

08

What Can Go Wrong

Partition Leader Crash

The hot partition holding priority 7–9 messages goes down. Each partition runs as a replica set (1 leader + 2 followers). On leader failure, a follower is promoted in 2–5 seconds. During failover, high-priority messages are briefly unavailable — consumers receive lower-priority messages instead. No data loss because the WAL is replicated.

Priority Inversion Under Load

Consumers are busy processing priority-2 items they already dequeued when a burst of priority-9 messages arrives. Mitigation: Dedicated consumer groups per priority tier. High-priority consumers never waste cycles on low-priority work. Keep visibility timeouts short (15–30s) so low-priority messages return to the queue quickly if consumers need to be reassigned.

Coordinator Heap Staleness

The coordinator's heap says partition 3 has a priority-7 head, but a priority-9 message just arrived there. A consumer gets a suboptimal message. Mitigation: Partitions push head-change notifications via lightweight pub/sub. Periodic full scans every 5 seconds as a consistency backstop.

Thundering Herd on Dequeue

1,000 consumers long-polling simultaneously. A single high-priority message arrives — all 1,000 connections wake up and race for it. 999 get nothing. Mitigation: The coordinator assigns messages to specific consumers — it doesn't broadcast availability. Only one consumer's long-poll is resolved per message. Thundering herd → orderly queue.

Visibility Timeout Duplication

Consumer A dequeues a message, processes it in 35s (timeout was 30s). Message reappears, consumer B picks it up → processed twice. Mitigation: Consumers extend visibility timeout before expiry (heartbeat pattern). The receipt handle expires on timeout — consumer A's late ACK is rejected. Consumers must always be idempotent.

Hot Partition Overwhelmed

A sudden flood of priority-9 messages exceeds the hot partition's capacity. Enqueues start failing. Mitigation: Auto-split hot partitions when depth exceeds a threshold. The hot tier can be a small cluster (3 nodes) with sub-sharding by hash, rather than a single node.

09

Interview Tips

💡
Start with a single-node priority queue.
Draw a min-heap or sorted set, show enqueue/dequeue, and only then say: "This works for one machine — now let's distribute it." Build from simple to complex. Never jump to Kafka + ZooKeeper on slide one.
Name the core tension immediately.
Say it explicitly: "The fundamental challenge is that partitioning gives us throughput but breaks global ordering. Every design decision I make is navigating this tradeoff." This frames the entire conversation and shows architectural maturity.
🎯
Don't confuse message queues with priority queues.
Kafka is a log — partition-ordered, offset-based consumption. A priority queue needs sorted-set semantics. If asked "why not Kafka?", answer: "Kafka gives FIFO per partition. We need cross-partition priority ordering — a fundamentally different access pattern."
🔑
Visibility timeout is your delivery-guarantee anchor.
If asked about exactly-once: "We provide at-least-once via visibility timeout. Exactly-once requires the consumer's side-effect to be idempotent — that's a consumer-side concern, not a queue concern. SQS and Pub/Sub work this way."
📈
Have a three-stage scaling story.
"At 10K/sec → single node with ZSET. At 100K/sec → priority-level sharding with dedicated tiers. At 1M/sec → hash partitioning with coordinator merge." Each stage justified by the numbers.
11

Evolution

How this design grows from MVP to planet-scale.

1

Single Node MVP

One Redis instance with a ZSET per queue. Composite score packing. Visibility timeout via a separate "in-flight" set with TTL. Handles up to 10K/sec. Durable via RDB snapshots + AOF. One consumer group. Good enough for a startup's task scheduler.

2

Priority-Tiered Partitioning

Split into hot (7–9), warm (4–6), cold (0–3) tiers. Each tier is a replica set. Dedicated consumer groups per tier. Add dead-letter queues, metrics dashboards, per-tenant quotas. WAL-based replication for tighter durability. Handles 100K/sec.

3

Coordinator-Based Global Ordering

Hash-partition within each priority tier for horizontal scale. Priority coordinator with partition-head min-heap. Push-based head-change notifications. Auto-splitting hot partitions. Handles 1M+/sec.

4

Planet-Scale Platform

Multi-region with regional coordinators and async cross-region replication. Multi-tenant SaaS with noisy-neighbour isolation. Adaptive priority rebalancing — if priority-9 floods, cap its partition share. Serverless consumer auto-scaling per priority depth. Full observability: per-priority latency histograms, consumer lag alerts, partition rebalance automation.

Next up