System Design — 09

Notification System

Design a scalable, multi-channel notification platform that reliably delivers OTPs in 2 seconds and promo blasts to 50 million users — without one starving the other.

Fan-outMulti-ChannelPriority QueuesDelivery GuaranteesKafka
01

Problem Statement

Every major app — Uber, Amazon, WhatsApp, YouTube — has a notification system behind it. But the notification system is rarely the product itself. It's infrastructure that every other team depends on. The payments team needs it for OTPs. Marketing needs it for promos. Social needs it for "X liked your post."

Think of it like a postal service inside your company. Any team can hand you a letter (notification request), and your job is to figure out: who gets it, through which channel, how urgently, and to confirm it was delivered.

Core question: How do you build a single system that handles a 2-second OTP delivery and a 50-million-user promo blast with the same infrastructure, without one starving the other?

Real-world context: Facebook sends billions of notifications daily. Uber's OTP and ride-status notifications are on the critical path — if they fail, the business fails. Amazon's "your package shipped" emails are high-volume but tolerant of a few minutes' delay. All three coexist in a single notification platform.

02

Requirements

Functional Requirements

  • Multi-channel delivery — Push (FCM/APNs), SMS (Twilio), Email (SES/SendGrid), In-app (stored feed)
  • Priority levels — Critical (OTP), High (ride update), Medium (social), Low (promos)
  • User preferences — Per-channel opt-in/out, quiet hours, frequency caps
  • Template system — Reusable templates with variable substitution: Hi {{name}}, your order {{id}} shipped
  • Delivery tracking — Per-notification lifecycle: created → queued → sent → delivered → read → failed
  • Scheduling — Future-dated and recurring notifications (weekly digest every Sunday 9 AM)
  • Bulk fan-out — One event triggers notifications to millions (e.g., all Premium users in a region)
  • In-app notification feed — Paginated, chronologically ordered list per user

Non-Functional Requirements

  • High availability — Tier-0 infrastructure. If notifications go down, OTPs fail, payments stall, rides can't start.
  • At-least-once delivery — We'd rather send a duplicate than lose an OTP. Clients handle dedup.
  • Latency-tiered — Critical: <3 seconds. High: <30s. Low: minutes are fine.
  • Massive throughput — Millions of notifications per minute during peak events.
  • Backpressure-aware — FCM, APNs, Twilio, SES all have rate limits. Throttle per-provider without blocking urgent traffic.
  • Ordering for in-app — In-app feed chronologically ordered per user. Cross-channel ordering doesn't matter.

Key NFR Tensions

Latency vs Throughput

A 10M-user promo blast and a single OTP hit the same system. If promo messages flood the queue, the OTP waits. Forces us toward priority queues.

Reliability vs Cost

SMS is reliable but costs $0.01–0.05/msg. Push is free but requires app install. Channel selection becomes a routing problem.

Consistency vs Availability

Preference updates need strong consistency. Sending a promo after opt-out is a GDPR violation.

Simplicity vs Flexibility

Multi-channel fallback, conditional logic, A/B testing on copy — each addition makes the routing layer more complex.

03

Scale Estimation

Starting assumptions: 500M registered users, 100M DAU (20% daily active), 6 notifications per DAU per day across transactional, social, and marketing.

600M
Notifications / Day
1.16B
Delivery Attempts / Day
~35K/s
Peak Creation QPS
~67K/s
Peak Delivery QPS
~1 KB
Per Notification
~600 GB
Daily Storage
54 TB
Active Feed (90 days)
~4.2 TB
Daily Outbound Bandwidth

Channel Breakdown

Each notification may hit multiple channels. In-app: 100% (600M). Push: 70% (420M). Email: 20% (120M). SMS: 3% (18M, critical only). Total delivery attempts: ~1.16B/day.

What the Numbers Force

35K/s peak → async queue-based architecture (can't do sync). 54 TB feed → Cassandra/DynamoDB, not PostgreSQL. 50M bulk sends → separate fan-out pipeline with throttling. Provider rate limits → per-channel worker pools with independent rate limiters.

04

API Design

Two distinct consumer sets: internal services (send notifications, service-to-service auth) and end users (read feed, manage preferences, user JWT auth).

Send Single Notification — Internal
POST /v1/notifications
Headers: X-Service-Id, X-Api-Key, X-Idempotency-Key

{
  "user_id": "user_abc123",
  "priority": "critical",
  "channels": ["push", "sms"],
  "fallback_chain": true,
  "fallback_timeout_sec": 30,
  "template_id": "otp_verification",
  "template_vars": { "otp_code": "482901", "expiry_minutes": "5" },
  "metadata": { "deep_link": "app://verify?code=482901" },
  "ttl_sec": 300
}

→ 202 Accepted { "notification_id": "notif_7f8a9b2c", "status": "queued" }

Why 202 not 201: We're not delivering synchronously. The request is accepted into the queue. Idempotency key prevents duplicate OTPs on retry. Fallback chain: try push first, SMS after 30s if push fails. TTL: expired OTPs are dropped from queue.

Send Bulk Notification — Internal
POST /v1/notifications/bulk

{
  "segment": { "type": "query", "filters": { "subscription": "premium", "country": "AE" } },
  "priority": "low",
  "channels": ["push", "email"],
  "template_id": "promo_sale_v2",
  "template_vars": { "sale_name": "January Mega Sale", "discount": "30%" },
  "scheduling": {
    "send_at": "2025-01-20T09:00:00Z",
    "timezone_aware": true,
    "throttle_per_sec": 10000
  },
  "respect_preferences": true,
  "dedup_key": "uae-sale-jan2025"
}

→ 202 Accepted { "bulk_id": "bulk_3e4f", "estimated_recipients": 1200000 }
Get Notification Feed — User
GET /v1/users/me/notifications?cursor=notif_abc&limit=20&unread_only=false
Authorization: Bearer <jwt>

→ 200 OK {
  "notifications": [ { "notification_id", "title", "body", "deep_link",
                        "priority", "is_read", "created_at" }, ... ],
  "cursor": "notif_6a5b4c",
  "has_more": true,
  "unread_count": 7
}
User Preferences
GET /PUT /v1/users/me/notification-preferences

{
  "categories": {
    "transactional": { "push": true, "email": true, "sms": true, "mutable": false },
    "marketing":     { "push": true, "email": true, "sms": false, "mutable": true }
  },
  "quiet_hours": { "enabled": true, "start": "23:00", "end": "07:00",
                   "timezone": "Asia/Dubai", "override_for_critical": true },
  "frequency_caps": { "marketing_push_per_day": 3 }
}
EndpointMethodConsumerPurpose
/v1/notificationsPOSTInternalSend single notification
/v1/notifications/bulkPOSTInternalSend to segment
/v1/users/me/notificationsGETUserRead feed (cursor-paginated)
/v1/users/me/notifications/readPATCHUserMark as read
/v1/users/me/notification-preferencesGET/PUTUserManage preferences
/v1/templatesCRUDInternalTemplate management
Webhook callbackPOSTSystem→ServiceDelivery status updates
05

High-Level Architecture

The notification system is a pipeline: a notification enters at one end and exits through a provider at the other. Everything in between is routing, filtering, prioritizing, and tracking.

Internal Services Payments, Social, Mktg Ingestion API Validate + Dedup HTTPS 🔴 Critical Queue 🟠 High Queue 🟡 Medium Queue ⚪ Low Queue Kafka Fan-Out Service Bulk → Individual Bulk jobs Processing Workers Template • Prefs • Route Preferences DB PostgreSQL (strong) Push Workers FCM / APNs Email Workers SES / SendGrid SMS Workers Twilio In-App Workers Feed + WebSocket Channel Queues Feed Store Cassandra (54 TB) Cache + Counters Redis Device Registry Tokens + Metadata Delivery Tracker Status + Webhooks WebSocket GW Real-time push Scheduler Cron + Delayed
Request Flow — Step Through
Payments SvcIngestion APIRedis DedupKafka CriticalProcessingPush WorkerFCMDelivery Tracker
Click Next Step to walk through the request flow.
06

Deep Dive — Fan-Out, Push vs Pull & Delivery Guarantees

Fan-Out: Staged Bulk Sends

The naive approach — enqueuing 50M messages at once — floods queues, blows memory during segment resolution, and causes a downstream provider stampede. The production approach uses three stages:

Stage 1 — Job Creation (instant): Write a single bulk_job row to the database, return 202.
Stage 2 — Segment Resolution (streaming): Fan-out service streams users in 5K batches via cursor, pre-filters by preferences (eliminates ~20% before queuing), throttles enqueue at caller-specified rate, and checkpoints progress for crash recovery.
Stage 3 — Delivery: Individual notifications flow through the normal pipeline at a steady, manageable rate.

Push vs Pull for In-App

Pull: User opens feed → query Cassandra (partition key = user_id, cluster key = created_at DESC). Single-partition read, ~2ms. Redis caches page 1 (last 50 notifications) for ~90% cache hit rate — 60× cheaper than storing the full feed in Redis RAM.

Push: Real-time badge updates via WebSocket Gateway. Connection registry in Redis maps user_id → gateway instance. On new notification, publish to the specific gateway holding the user's connection. At 10M concurrent connections, ~25 GB RAM per gateway instance across 20 pods.

Delivery Guarantees: At-Least-Once + Idempotency

Exactly-once across a network boundary with external providers is impossible (Two Generals' Problem). We always retry, with idempotency at 4 layers:

sequenceDiagram participant PS as Payments Service participant API as Ingestion API participant RD as Redis (Dedup) participant KF as Kafka Critical participant PW as Processing Worker participant PU as Push Worker participant FCM as FCM (Google) participant DT as Delivery Tracker participant SMS as SMS Worker PS->>API: POST /v1/notifications (OTP) API->>RD: Check idempotency key RD-->>API: Not seen ✓ API->>KF: Enqueue to notifications.critical API-->>PS: 202 Accepted KF->>PW: Consume message PW->>RD: Check processed set (SETNX) RD-->>PW: New ✓ PW->>PW: Resolve template + check prefs PW->>KF: Enqueue to delivery.push KF->>PU: Consume push message PU->>RD: Check delivery log RD-->>PU: Not sent ✓ PU->>FCM: Send push notification FCM-->>PU: 200 OK PU->>DT: Status → "sent" Note over DT: Start 30s fallback timer FCM-->>DT: Delivery receipt DT->>DT: Status → "delivered" DT->>DT: Cancel SMS fallback ✓ DT->>PS: Webhook: OTP delivered via push Note over DT,SMS: If push fails or times out after 30s: DT->>SMS: Trigger SMS fallback SMS->>SMS: Deliver via Twilio

Layer 1 — Ingestion

Redis SETNX on idempotency key (24h TTL). Duplicate requests return same notification_id without re-queuing.

Layer 2 — Processing

Atomic SETNX on processed set. Kafka consumer commits offset after processing, not before.

Layer 3 — Delivery

Check delivery log before calling provider. FCM collapse_key reduces visible duplicates on device.

Layer 4 — Client

Client maintains local set of recent notification_ids (~200). Last line of defense against server-side dedup failures.

Retry Strategy

Exponential backoff with retry Kafka topics: delivery.push.retry.1 (2s), .retry.2 (8s), .retry.3 (32s), .dlq (dead letter). Critical: 6 retries (~10 min window). Low: 2 retries (~10s), then silent drop.

07

Key Design Decisions & Tradeoffs

Queue Architecture

✓ Chosen

Separate Kafka Topics per Priority

4 topics with dedicated consumer pools. Critical consumers are never blocked by low-priority backlog. 60/25/10/5 worker allocation.

✗ Alternative

Single Topic with Priority Field

Kafka is FIFO per partition — no skip-ahead. A 50M promo blast blocks OTPs. Would require building an in-app priority queue on top of Kafka.

Feed Store

✓ Chosen

Cassandra

Single-partition reads at 54 TB. Native TTL for 90-day expiry. ~$5K/month. Partition key = user_id, clustering key = created_at DESC.

✗ Alternative

Redis Sorted Sets

~10.8 TB in RAM at ~$324K/month. Great as a page-1 cache (we use it for that), but unsustainable as the primary 90-day store.

Real-Time Updates

✓ Chosen

WebSocket + Pull Fallback

<1s badge updates for online users. Connection registry in Redis for targeted routing. 20 gateway pods, ~25 GB RAM each.

✗ Alternative

Push Notification Piggyback

Zero extra infrastructure. Client updates badge on push receipt. But 5–30s delay due to OS throttling. Fine for most apps, not for ride-status-critical ones.

Delivery Guarantee

✓ Chosen

At-Least-Once + Idempotency

Always retry on failure. 4-layer dedup (ingestion, processing, delivery, client). Duplicate rate <0.1%. Can't lose OTPs.

✗ Alternative

Exactly-Once

Impossible across network boundaries with external providers (FCM, Twilio). Only works within Kafka's own transaction boundary.

API Model

✓ Chosen

Async (202 Accepted)

Decouple acceptance from delivery. Ingestion API stays fast (<50ms p99). Provider outages don't cascade to ingestion unavailability.

✗ Alternative

Synchronous (200 with result)

At 7K/sec, blocking until FCM responds requires 14K+ concurrent threads. Provider outage cascades to all channels. Only viable at <100 notifications/sec.

Template Resolution

✓ Chosen

Write Time (Eager)

Resolve templates at processing time. Store full rendered text. No read-path dependency on template service. Simple.

✗ Alternative

Read Time (Lazy)

4× less storage per record. Allows retroactive template fixes. But adds latency to every feed read and creates hard dependency on template service for reads.

Preference Check Location

✓ Chosen

Coarse at Fan-Out + Fine at Processing

Coarse check eliminates ~20% of bulk messages before queueing (saves 10M Kafka messages on a 50M blast). Fine check catches quiet hours, frequency caps, channel prefs.

✗ Alternative

Processing Only

Simpler (one check location). But wastes queue capacity and worker resources on notifications we'll ultimately drop.

08

What Can Go Wrong

🔴 Provider Outage (FCM / Twilio / SES)

Impact: One channel completely blocked. Retry storms fill retry topics, new messages pile up. Mitigation: Circuit breaker pattern — after 50 failures in 60s, stop sending to that provider. Critical traffic routes to fallback channel. Non-critical held in deferred topic until recovery.

🔴 Thundering Herd from Bulk Send

Impact: 50M promo blast + organic traffic overwhelms push workers. FCM returns 429s. Mitigation: Physical isolation — separate worker pools for critical vs low priority with independent FCM API quotas. Adaptive throttling on fan-out based on consumer lag.

🟠 Preference Desync — Send After Opt-Out

Impact: User opts out of marketing at 14:00:01, cached preference is stale, marketing push sent at 14:00:03. GDPR violation risk. Mitigation: Cache invalidation on write. Marketing notifications bypass cache — synchronous PostgreSQL read for compliance.

🟠 SMS Cost Explosion

Impact: Delayed FCM delivery receipts trigger SMS fallbacks unnecessarily. 1% false fallback × 420M pushes = 4.2M wasted SMS = $126K/day. Mitigation: Conservative fallback timeouts per priority (30s critical, 2min high, none for medium/low). Re-check push delivery status before sending SMS.

🟠 Device Token Rot

Impact: 10–30% of stored tokens become stale over time. 84M wasted FCM calls/day. Mitigation: Reactive cleanup on "NotRegistered" response. Weekly proactive health check with FCM dry_run. Token refresh on every app open.

🟡 Kafka Broker Failure

Impact: Writes fail for affected partitions. Mitigation: Replication factor = 3, min.insync.replicas = 2. Leadership transfer in ~30s. Redis fallback buffer for critical notifications during brief outages.

🟡 Duplicate Notifications Despite Idempotency

Impact: Worker crashes between processing and Redis write → reprocesses on restart. Mitigation: Atomic SETNX instead of check-then-set. If Redis is down, fail open (accept brief duplicate spike). Client-side dedup as ultimate safety net.

🟡 Queue Starvation (Low Priority Never Processed)

Impact: Marketing campaigns that should deliver in 1 hour take 12 hours. Mitigation: Dynamic worker rebalancing based on consumer lag. Minimum 2 workers per priority level always. SLA alerts per priority topic.

Anti-patterns

🚫
Send every notification to every device immediately

10M notifications/sec ingest; APNs/FCM rate-limits throttle you.

✓ Better: Queue + batch per-device-token; respect provider rate limits; dedup collapsed notifications.
🚫
Send SMS for every in-app notification

Cost explodes; users unsubscribe.

✓ Better: Channel preferences per user; in-app default; SMS only for critical (payment, 2FA).
🚫
Write notification to DB synchronously before acking API

Notification service becomes part of the hot path.

✓ Better: Fire-and-forget event to Kafka; notify service consumes + delivers async.
09

Interview Tips

💡
Scope first — ask 5 questions
What channels? Platform or feature? Scale (DAU)? Priority levels? Delivery guarantees? This narrows a dozen-direction problem into a specific design. Proposing numbers yourself ("Let me assume 100M DAU") signals confidence.
Draw the pipeline, not the boxes
Start with the data flow: Ingestion → Priority Queues → Processing → Channel Queues → Providers. This gives the interviewer the full picture in 60 seconds. Then add detail where they want to go deep.
🎯
Lead with the OTP-vs-promo tension
This is the defining constraint. Explain priority isolation (separate Kafka topics, dedicated consumer pools). This single point demonstrates you understand both the business requirement and the technical solution.
🚫
Never say "exactly-once delivery"
It's impossible across a network boundary with external providers. Say "at-least-once with idempotency at every layer" and explain the 4-layer dedup strategy. This signals real distributed systems understanding.
🏗️
Proactively discuss failures
Don't wait for "what could go wrong?" Bring up provider outages and circuit breakers yourself. Mention the thundering herd from bulk sends. This signals senior-level thinking that goes beyond the happy path.
📐
45-minute structure
Scoping (5 min) → Scale estimation (3 min) → Architecture pipeline (10 min) → API design (5 min) → Deep dive on ONE topic for 15 min (priority queues OR delivery guarantees OR fan-out) → Tradeoffs + failures (5 min) → Q&A (2 min).
11

Evolution

How this design grows from MVP to planet-scale. Architecture is derived from scale, not pattern-matched from blog posts.

1

MVP — Single Server, One Channel (10K users)

Flask + PostgreSQL + Celery + FCM. One channel, one priority, one server. Preferences are a boolean column. Templates are hardcoded strings. Deployable in a day. Breaks at ~1M notifications/day when the single Celery worker becomes a bottleneck.

2

Multi-Channel with Queues (1M DAU)

Add email + SMS channels. Replace Celery with Kafka for durability and replay. Two priority topics (critical, normal). User preference table with per-channel opt-in/out. Separate notification feed table partitioned by month.

3

Platform Scale (50M DAU)

Template CRUD API. 4 priority levels with dedicated consumer pools. Bulk fan-out service with throttling. Cassandra for feed store. Fallback chains with delivery tracker. Webhook system for calling services. WebSocket gateway for real-time updates.

4

Planet Scale (500M DAU)

Multi-region deployment. ML-powered channel selection (predicts best channel per user). Notification aggregation ("Ahmed and 47 others liked your post"). Batch provider APIs. Send-time optimization. Per-team SMS cost governance. GDPR compliance audit trails.

Next up