Concept · Messaging & Async

Message Queue vs Pub/Sub

01

Why this matters

Two teams both say "we use a queue." One means point-to-point work distribution (one message, one consumer takes it, runs a job). The other means broadcast (one event, N subscribers each process it). These have different correctness properties, different failure modes, different products. Confusing them has wrecked many systems.

In interviews: "would you use SQS or Kafka for this?" is really asking if you understand the difference.

02

The two shapes

Message queue (point-to-point)

One consumer wins each message

Producer enqueues, N consumers compete. Each message is delivered to exactly one consumer. Queue is a work distribution mechanism. SQS, RabbitMQ (classic queues), Redis BLPOP. Used for: job queues, email sending, image processing.

Pub/Sub (broadcast)

Every subscriber gets every message

Publisher emits, all N subscribers receive. Each subscriber processes independently. Pub/Sub is a fan-out mechanism. Kafka (topic + consumer groups), SNS, Redis pub/sub, NATS. Used for: event streaming, cache invalidation, audit logs.

The confusing middle

Kafka is technically pub/sub, but a "consumer group" acts like a queue (each partition consumed by one member of the group). So Kafka does both — which is why it's everywhere. RabbitMQ also does both via exchange types. The difference is per-usage, not per-product.

03

Queue semantics

  • Durability — messages persist (disk) until ACKed. Crashed consumer → message redelivered.
  • Visibility timeout — when a consumer picks up a message, it's invisible to others for N seconds. If the consumer ACKs, gone; if not, re-queued. Prevents duplicate work unless the consumer crashes.
  • At-least-once delivery — on retries, a message may be delivered twice. Consumers must be idempotent.
  • FIFO vs unordered — SQS has both (standard = unordered, higher throughput; FIFO = strict order, lower throughput). RabbitMQ is FIFO within a queue.
  • Dead letter queue — messages that fail N times move to a DLQ for inspection. Critical for production.
Simple in-memory pub/sub with backpressure
import asyncio
from collections import defaultdict

class PubSub:
    def __init__(self, buffer_size=1000):
        self.channels = defaultdict(list)  # topic -> [asyncio.Queue]

    def subscribe(self, topic):
        q = asyncio.Queue(maxsize=1000)
        self.channels[topic].append(q)
        return q

    async def publish(self, topic, msg):
        dead = []
        for q in self.channels[topic]:
            try:
                q.put_nowait(msg)   # drop on backpressure, don't block
            except asyncio.QueueFull:
                dead.append(q)       # slow consumer → disconnect
        for q in dead: self.channels[topic].remove(q)

# Real systems (Redis pub/sub, Kafka) follow this shape but with durability
# + cluster-wide routing. In-memory version is the mental model.
04

Pub/Sub semantics

  • Topics. Producers publish to a topic name. Consumers subscribe by topic.
  • Retention. Kafka retains messages by time (7 days default). Subscribers can replay from any offset. SNS + most pub/sub systems deliver and forget.
  • Consumer groups (Kafka). Members of the same group share a topic's load (partition affinity). Two different groups each get all messages — the fan-out property.
  • Ordering — guaranteed within a partition (Kafka) or within a topic for one subscriber (RabbitMQ topic exchange). Not cross-partition.
05

When to reach for each

Use caseShapeProduct
Background jobs (thumbnails)QueueSQS, RabbitMQ, Sidekiq
Event streaming, audit logPub/SubKafka
Email dispatchQueueSQS, RabbitMQ
Cache invalidation fan-outPub/SubRedis Pub/Sub, NATS
Multiple services react to an orderPub/SubKafka (per-consumer-group semantics)
Low-latency chat deliveryPub/SubNATS, Redis Pub/Sub
Payment processingQueue + idempotency keysSQS FIFO, Kafka compacted
06

Deep dive — why Kafka ate the world

Kafka unified queue and pub/sub by storing messages in a distributed commit log. Every message appended to a partition file, retained by time. Consumers read by offset — they can rewind, replay, or skip.

Four properties this gives you for free:

  • Replay. A new service can consume messages from 7 days ago. Classic queues delete on ACK — no replay possible.
  • Multiple independent consumers. Add a second analytics pipeline; it reads from the same topic without affecting the first.
  • High throughput. Sequential append to disk is fast. A single Kafka broker handles ~500k messages/sec. Partitioning scales horizontally.
  • Exactly-once semantics (with transactional API) — because consumers track offsets, and Kafka supports atomic offset + message writes.

The cost: Kafka is heavy operationally. Zookeeper (or KRaft), brokers, topic planning, consumer lag monitoring. Don't reach for Kafka if SQS fits — SQS is fully managed and you pay per message. Reach for Kafka when you need replay, high throughput, or many independent consumers of the same stream.

07

Real-world

Stripe

SQS for jobs, Kafka for events

Webhook delivery uses SQS (point-to-point retry). Event audit log uses Kafka (long retention, many consumers).

Uber

Kafka for everything streaming

Billions of events/day — trip updates, driver locations, analytics. Kafka's throughput and replay are non-negotiable at Uber's scale.

Shopify

Kafka for order processing

Pub/Sub model: order placed → cart, inventory, shipping, analytics, email services all consume independently.

Discord

Pub/Sub for chat delivery

Message written → published to a channel topic → all connected user sessions subscribed to the channel receive it.

08

Used in problems

News feed uses Kafka for fan-out events. Notification system uses queue (per-channel delivery). Distributed job scheduler uses Kafka as the task bus. WhatsApp uses pub/sub for broadcasting messages to active sessions.

Next up