Distributed Priority Queue

01

Problem Statement

Design a distributed priority queue system that accepts messages tagged with priority levels, stores them durably, and delivers them to consumers in strict or near-strict priority order — even under high throughput and node failures. The system must support delayed/scheduled messages, at-least-once delivery with visibility timeouts, and dead-letter queues for poison messages.

Think of a hospital emergency room triage system. Patients arrive constantly, each tagged with an urgency level. The system must always serve the most critical patient next — across multiple hospitals (nodes). A sprained ankle should never be treated before a heart attack, no matter which hospital either patient walked into.

Core question: How do you enforce priority ordering across distributed partitions without creating a single serialization bottleneck that kills throughput?

02

Requirements

Functional Requirements

Enqueue with priority — Producers submit messages with a priority level (0–9, higher = more urgent) and optional payload up to 256 KB
Dequeue by priority — Consumers receive the globally highest-priority message available, with FIFO ordering within the same priority level
Delayed / scheduled messages — Enqueue now, but the message only becomes eligible for delivery at a specified future timestamp
At-least-once delivery — Messages are made invisible on dequeue (visibility timeout). If not acknowledged, they reappear for retry
Dead-letter queue (DLQ) — Messages that fail processing after N attempts are moved to a separate queue for inspection
Multi-tenant isolation — Multiple independent queues with per-tenant quotas and rate limits

Non-Functional Requirements

Throughput: 100K+ enqueues/sec sustained, with 3x peak headroom
Dequeue latency: p99 < 50ms for highest-priority items
Durability: No message loss after enqueue is acknowledged — survive single-node failures
Ordering: Near-strict global priority ordering (< 5ms ordering gap acceptable across partitions)
Scalability: Horizontal — adding nodes increases capacity linearly
Availability: Favour availability over strict consistency (AP in CAP terms)

03

Scale Estimation

Modelling a priority-aware queue used for task scheduling, ad auction ranking, alert triage, or order fulfillment across a large platform — 500M messages/day with 2 KB average message size.

~5,800/s

Avg enqueue rate

~17,400/s

Peak enqueue (3×)

~1 TB/day

Raw storage

~7 TB

7-day retention

~30K/s

Peak dequeue rate

10

Priority levels (0–9)

Derivation

500M messages ÷ 86,400 seconds = ~5,800 enqueues/sec. Peak is 3× average = 17,400/sec. Storage: 500M × 2 KB = 1 TB/day, 7 TB for 7-day retention. Priority distribution is skewed: ~70% land in priorities 0–3, ~20% in 4–6, and only ~10% in 7–9. This skew is critical — it means the "hot" partition handling high-priority messages sees only 10% of total traffic, making it feasible to run as a small, dedicated replica set. Dequeue peak is higher than enqueue peak (30K/s) because consumers burst-poll high-priority buckets aggressively.

The real bottleneck isn't writes — it's dequeue contention. If 50 consumers all ask for "the most urgent item" simultaneously, they're all competing for the same record. Naive designs collapse here.

04

API Design

Enqueue — POST /v1/queues/{queue_id}/messages

{
  "payload":     "{ ... }",              // opaque bytes, up to 256 KB
  "priority":    7,                      // 0–9, higher = more urgent
  "delay_until": "2026-04-09T15:00Z",   // optional: scheduled delivery
  "dedup_key":   "order-12345",          // optional: idempotency key
  "ttl_seconds": 86400,                 // auto-expire if unconsumed
  "metadata":    { "tenant": "acme", "trace_id": "abc-123" }
}

→ 201 Created
{
  "message_id": "msg_8f3a...",
  "enqueued_at": "2026-04-09T12:00:00Z",
  "partition":   3
}

Dequeue — GET /v1/queues/{queue_id}/messages?max=10&wait_seconds=20&min_priority=5

→ 200 OK
{
  "messages": [
    {
      "message_id":         "msg_8f3a...",
      "payload":            "{ ... }",
      "priority":           9,
      "receipt_handle":     "rh_xk29...",   // one-time token for ACK
      "enqueued_at":        "...",
      "attempt":            1,
      "visibility_timeout": 30              // seconds
    }
  ]
}

Acknowledge — DELETE /v1/queues/{queue_id}/messages/{message_id}

Header: X-Receipt-Handle: rh_xk29...
→ 204 No Content

Negative Acknowledge — POST /v1/queues/{queue_id}/messages/{message_id}/nack

{
  "receipt_handle": "rh_xk29...",
  "delay_seconds":  10                  // optional backoff before retry
}
→ 200 OK

Queue Stats — GET /v1/queues/{queue_id}/stats

→ 200 OK
{
  "depth_by_priority": { "9": 12, "8": 45, "7": 230, ... },
  "in_flight":         87,
  "dlq_depth":         3,
  "oldest_message_age_seconds": 142
}

Key API Design Choices

Receipt handle — a one-time token proving you received this message. Prevents double-ACK from a different consumer. Borrowed from SQS.
Long polling (wait_seconds) — consumer blocks server-side until a message arrives or timeout expires. Avoids thundering-herd polling.
min_priority filter — lets you dedicate consumer groups to high-priority work only.
Dedup key — producer-side idempotency. Same key within a window is silently dropped.

05

High-Level Architecture

The architecture splits into three planes: Ingest (stateless API servers), Storage (priority-partitioned sorted sets with WAL durability), and Dispatch (coordinator that merges partition heads to serve globally-ordered dequeues).

Component Roles

API Servers — Stateless. Auth, validation, dedup check, rate limiting. Route to the correct partition tier.
Partition Router — Routes by priority band: 7–9 → Hot, 4–6 → Warm, 0–3 → Cold.
Hot/Warm/Cold Partitions — Each is a self-contained priority queue using Redis ZSET with composite score (priority × 10¹³ + timestamp). Cold tier uses RocksDB for cost efficiency.
Priority Coordinator — Maintains a min-heap of partition heads. Consumers ask the coordinator, not individual partitions.
WAL — Append-only write-ahead log on disk. On crash, replay to rebuild ZSET.
ZooKeeper/etcd — Leader election for coordinator, partition assignments, consumer group state.

Request Flow — Step Through

Producer→API Server→Partition Router→Hot Partition→Coordinator→Consumer→ACK→Delete

Click Next Step to walk through the request flow.

06

Deep Dive — Global Priority Ordering Across Partitions

The moment you partition for throughput, you break global ordering. This is the fundamental tension of distributed priority queues. Here are three approaches, from simple to planetary:

Approach 1: Single Partition

One node, one sorted set

ZPOPMIN gives the highest-priority item instantly. Works beautifully up to ~10K/sec. Beyond that, you're stuck — vertical scaling has a ceiling, and this is your single point of failure. Does not scale.

Approach 2: Priority-Level Sharding ✓

Partition by priority band — our chosen approach

Instead of partitioning by message key, partition by priority level. Priorities 7–9 go to a dedicated "hot" partition. Priorities 0–3 go to "cold". This is elegant because:

• High-priority consumers only talk to the hot partition — no cross-partition coordination needed for the most latency-sensitive traffic
• The hot partition is small (10% of total traffic) — a single node or small replica set handles it easily
• Cold partitions can be sharded further by key, since ordering among low-priority messages is less critical

The tradeoff: A consumer asking for "the single highest-priority message globally" must waterfall: drain hot first, then warm, then cold. This is a simple sequential check, not a complex merge.

Approach 3: Coordinator Merge (Planet-Scale)

Hash-partition everything + coordinator with min-heap

For millions/sec: hash-partition all messages and run a Priority Coordinator that maintains a min-heap of each partition's head element. The coordinator pops the heap, issues ZPOPMIN to the winning partition, and pushes the new head back. The coordinator stores zero messages — only partition heads. Even with 1,000 partitions, the heap has 1,000 entries → O(log 1000) ≈ 10 comparisons. Not a bottleneck.

Composite Score Packing

Inside each partition's ZSET, priority and timestamp are packed into a single 64-bit float score:

score = (MAX_PRIORITY - priority) × 10¹³ + timestamp_microseconds

// Priority 9, ts 1712678400000000:
score = (9-9) × 10¹³ + 1712678400000000 = 1712678400000000

// Priority 7, ts 1712678400000001:
score = (9-7) × 10¹³ + 1712678400000001 = 20001712678400000001

// Lower score = higher priority = dequeued first
// ZPOPMIN always returns the most urgent, oldest message

Consumer Dequeue Sequence

sequenceDiagram participant C as Consumer participant CO as Coordinator participant P2 as Partition 2 (Hot) participant P0 as Partition 0 (Cold) C->>CO: GET /dequeue (long-poll) CO->>CO: Check heap top → (pri=9, part=2) CO->>P2: ZPOPMIN P2-->>CO: msg_8f3a (pri=9) P2-->>CO: New head: (pri=7, ts=1010) CO->>CO: Push (pri=7, part=2) into heap CO-->>C: msg_8f3a + receipt_handle Note over C: Visibility timeout starts (30s) C->>CO: ACK msg_8f3a CO->>P2: Delete msg_8f3a permanently

The Ordering Gap

Between the coordinator popping the heap and the new head being reported, a higher-priority message could arrive on that partition unseen. This window is typically < 5ms. The ordering is near-strict, not perfectly strict. For task scheduling, alert triage, and ad ranking, this is perfectly acceptable.

Coordinator failure recovery: The coordinator is stateless — it can be rebuilt by querying each partition's head. On failover via ZooKeeper leader election, the new coordinator scans all partition heads (~50ms with 1,000 partitions) and resumes.

07

Route all dequeues through a single coordinator with pessimistic locking. Caps throughput at ~50K/sec — the coordinator becomes the ceiling. Acceptable for small-scale only.

08

What Can Go Wrong

Partition Leader Crash

The hot partition holding priority 7–9 messages goes down. Each partition runs as a replica set (1 leader + 2 followers). On leader failure, a follower is promoted in 2–5 seconds. During failover, high-priority messages are briefly unavailable — consumers receive lower-priority messages instead. No data loss because the WAL is replicated.

Priority Inversion Under Load

Consumers are busy processing priority-2 items they already dequeued when a burst of priority-9 messages arrives. Mitigation: Dedicated consumer groups per priority tier. High-priority consumers never waste cycles on low-priority work. Keep visibility timeouts short (15–30s) so low-priority messages return to the queue quickly if consumers need to be reassigned.

Coordinator Heap Staleness

The coordinator's heap says partition 3 has a priority-7 head, but a priority-9 message just arrived there. A consumer gets a suboptimal message. Mitigation: Partitions push head-change notifications via lightweight pub/sub. Periodic full scans every 5 seconds as a consistency backstop.

Thundering Herd on Dequeue

1,000 consumers long-polling simultaneously. A single high-priority message arrives — all 1,000 connections wake up and race for it. 999 get nothing. Mitigation: The coordinator assigns messages to specific consumers — it doesn't broadcast availability. Only one consumer's long-poll is resolved per message. Thundering herd → orderly queue.

Visibility Timeout Duplication

Consumer A dequeues a message, processes it in 35s (timeout was 30s). Message reappears, consumer B picks it up → processed twice. Mitigation: Consumers extend visibility timeout before expiry (heartbeat pattern). The receipt handle expires on timeout — consumer A's late ACK is rejected. Consumers must always be idempotent.

Hot Partition Overwhelmed

A sudden flood of priority-9 messages exceeds the hot partition's capacity. Enqueues start failing. Mitigation: Auto-split hot partitions when depth exceeds a threshold. The hot tier can be a small cluster (3 nodes) with sub-sharding by hash, rather than a single node.

09

Interview Tips

💡

Start with a single-node priority queue.
Draw a min-heap or sorted set, show enqueue/dequeue, and only then say: "This works for one machine — now let's distribute it." Build from simple to complex. Never jump to Kafka + ZooKeeper on slide one.

⚡

Name the core tension immediately.
Say it explicitly: "The fundamental challenge is that partitioning gives us throughput but breaks global ordering. Every design decision I make is navigating this tradeoff." This frames the entire conversation and shows architectural maturity.

🎯

Don't confuse message queues with priority queues.
Kafka is a log — partition-ordered, offset-based consumption. A priority queue needs sorted-set semantics. If asked "why not Kafka?", answer: "Kafka gives FIFO per partition. We need cross-partition priority ordering — a fundamentally different access pattern."

🔑

Visibility timeout is your delivery-guarantee anchor.
If asked about exactly-once: "We provide at-least-once via visibility timeout. Exactly-once requires the consumer's side-effect to be idempotent — that's a consumer-side concern, not a queue concern. SQS and Pub/Sub work this way."

📈

Have a three-stage scaling story.
"At 10K/sec → single node with ZSET. At 100K/sec → priority-level sharding with dedicated tiers. At 1M/sec → hash partitioning with coordinator merge." Each stage justified by the numbers.

10

Evolution

How this design grows from MVP to planet-scale.

1

Single Node MVP

One Redis instance with a ZSET per queue. Composite score packing. Visibility timeout via a separate "in-flight" set with TTL. Handles up to 10K/sec. Durable via RDB snapshots + AOF. One consumer group. Good enough for a startup's task scheduler.

2

Priority-Tiered Partitioning

Split into hot (7–9), warm (4–6), cold (0–3) tiers. Each tier is a replica set. Dedicated consumer groups per tier. Add dead-letter queues, metrics dashboards, per-tenant quotas. WAL-based replication for tighter durability. Handles 100K/sec.

3

Coordinator-Based Global Ordering

Hash-partition within each priority tier for horizontal scale. Priority coordinator with partition-head min-heap. Push-based head-change notifications. Auto-splitting hot partitions. Handles 1M+/sec.

4

Planet-Scale Platform

Multi-region with regional coordinators and async cross-region replication. Multi-tenant SaaS with noisy-neighbour isolation. Adaptive priority rebalancing — if priority-9 floods, cap its partition share. Serverless consumer auto-scaling per priority depth. Full observability: per-priority latency histograms, consumer lag alerts, partition rebalance automation.

📺

References & Videos

System Design: Message Queues

ByteByteGo · 10 min

Kafka Architecture Deep Dive

Hussein Nasser · 25 min

Apache Kafka Documentation

Apache Kafka

Design a Message Queue

AlgoMaster

Problem Statement

Requirements

Functional Requirements

Non-Functional Requirements

Scale Estimation

Derivation

API Design

Key API Design Choices

High-Level Architecture

Component Roles

Deep Dive — Global Priority Ordering Across Partitions

Approach 1: Single Partition

One node, one sorted set

Approach 2: Priority-Level Sharding ✓

Partition by priority band — our chosen approach

Approach 3: Coordinator Merge (Planet-Scale)

Hash-partition everything + coordinator with min-heap

Composite Score Packing

Consumer Dequeue Sequence

The Ordering Gap

Key Design Decisions & Tradeoffs

1. Priority-Level Sharding vs Hash Sharding

Priority-Level Sharding

Hash Sharding

2. Pull (Long-Poll) vs Push (Streaming)

Long-Poll (Pull)

Push via WebSocket / gRPC Stream

3. At-Least-Once vs Exactly-Once

At-Least-Once + Visibility Timeout

Exactly-Once

4. Redis ZSET vs Heap-on-Disk (RocksDB)

Redis ZSET (Hot) + RocksDB (Cold)

All-Redis or All-Disk

5. Strict vs Near-Strict Global Ordering

Near-Strict (< 5ms gap)

Strict Global Ordering

What Can Go Wrong

Partition Leader Crash

Priority Inversion Under Load

Coordinator Heap Staleness

Thundering Herd on Dequeue

Visibility Timeout Duplication

Hot Partition Overwhelmed

Interview Tips

Similar Problems

Kafka / Message Queue

Distributed Task Scheduler

Rate Limiter

Notification System

Evolution

Single Node MVP

Priority-Tiered Partitioning

Coordinator-Based Global Ordering

Planet-Scale Platform

References & Videos

Kafka / Message Queue

Distributed Task Scheduler

Message Queue vs Pub/Sub