A real-time messaging system supporting one-to-one and group chat, delivery receipts, and offline message queuing at 2 billion user scale. The core challenge: maintaining 50 million persistent WebSocket connections while routing messages across hundreds of stateful servers in under 100ms.
Media sending — object storage + URL in message body
Typing indicators and online / last seen presence
Multi-device sync
E2EE — Acknowledge Deeply
Signal Protocol — per-message key derivation
Server is a blind relay — forwards ciphertext it cannot read
Keys stored only on devices, never on server
03
Scale Estimation
Metric
Calculation
Result
Total / Daily Active Users
Industry figures
2B / 500M DAU
Messages per day
Industry figure
100B
Average write RPS
100B ÷ 86,400
~1.16M/sec
Peak write RPS
1.16M × 3 (peak multiplier)
~3.5M/sec
Avg message size
Text + metadata
~500 bytes
Storage per day
1.16M × 500B × 86,400
~50 TB/day
Concurrent WebSocket connections
10% of DAU online simultaneously
~50M
Chat Server Sizing
Memory per WebSocket connection: ~100 KB. A 32 GB server at 65% utilisation holds
~200K connections. To serve 50M concurrent connections:
50M ÷ 200K = ~250 chat servers. Provision
300–350 for failover headroom and traffic spikes.
The Write Amplification Trap
One message in a 256-person group generates 255 delivery receipts + 255 read receipts — ~510 additional writes. At scale, the actual write load is 3–5× the raw message count. Always account for this in your estimates.
04
API Design
The primary API is WebSocket-based for real-time messaging. REST endpoints handle auxiliary operations — fetching history, managing groups, registering devices.
Persistent bidirectional connections. The LB uses sticky sessions — consistent hashing on user ID ensures Alice's frames always reach Server A. If Server A dies, Alice reconnects with exponential backoff + jitter.
Redis — Connection Registry
Maps user_id → server_id with a 60-second TTL refreshed by heartbeat. The most underrated component in this design. Without it, cross-server message routing is impossible. If Redis goes down, delivery degrades gracefully to the offline queue.
Kafka — Message Pipeline
Three jobs: (1) durable buffer so Alice gets ✓ in <1ms before DB write, (2) fan-out for group messages — one write triggers N deliveries, (3) carries receipt events back upstream from recipient to sender.
Cassandra — Message Storage
Partitioned by conversation_id, clustered by timestamp DESC. Chosen over PostgreSQL for linear write scalability — append-only write path (memtable + commit log) handles 1.2M writes/sec without manual sharding.
06
Deep Dive — WebSockets & Real-Time Routing
The Core Problem
HTTP is request-response — the server can never reach out unprompted. Chat requires the server to push to the client at any moment. WebSockets solve this with a persistent bidirectional connection. But persistent connections are stateful — and stateful connections break standard horizontal scaling.
Why delivery feels instant across geography. The naive mental model is Alice → WhatsApp California → Bob. That would be ~250ms of physics. The real path is Alice → nearest regional DC → private backbone → Bob's regional DC → Bob. Multi-region deployment dramatically shortens the actual travel distance.
More importantly, the WebSocket is already open. HTTP incurs TCP + TLS handshake overhead on every request — 3-4 round trips before any data moves. WebSocket pays this cost once on connect. Every subsequent message is one round trip. At 50ms RTT to the nearest DC, that's 50ms latency for same-region delivery.
Receipt latency breakdown (Bob receives → Alice sees ✓✓): Bob → Server B ~10ms · Server B → Kafka <1ms · Kafka → Receipt Handler ~5ms · Redis lookup <1ms · Handler → Server A ~10ms · Server A → Alice ~10ms = ~37ms total.
Sequence — Message Send & Receipt FlowMermaid.js
sequenceDiagram
participant A as Alice
participant SA as Chat Server A
participant K as Kafka
participant R as Redis
participant C as Cassandra
participant SB as Chat Server B
participant B as Bob
A->>SA: send "hey" (WebSocket frame)
SA->>K: publish message event
K-->>SA: ack < 1ms
SA-->>A: ✓ sent receipt
par Async consumers
K->>C: write to Cassandra
and
K->>R: GET session:bob
R-->>K: "chat-server-B"
K->>SB: deliver to Bob
SB->>B: push message (WebSocket)
B-->>SB: delivered receipt
SB->>K: publish receipt event
K->>R: GET session:alice
R-->>K: "chat-server-A"
K->>SA: route receipt
SA-->>A: ✓✓ grey (delivered)
end
B->>SB: opens chat (read receipt)
SB->>K: publish read event
K->>SA: route read receipt
SA-->>A: ✓✓ blue (read)
The Connection Registry — Most Missed Component
When Bob connects, Chat Server B writes: SET session:bob → "chat-server-B" EX 60. Every message delivery does GET session:bob first. The 60-second TTL means if Server B crashes, Redis auto-cleans Bob's stale entry — no manual cleanup needed. This is the piece most YouTube tutorials skip entirely.
Request Flow — Step Through
Alice · Mobile→Chat Server A · WebSocket→Kafka · Pipeline→Redis Lookup · session:bob→Chat Server B · Bob's server→Bob · Mobile
Click Next Step to walk through the request flow.
08
Key Design Decisions & Tradeoffs
Chosen
WebSockets
Persistent bidirectional connection. One-time handshake cost. Idle connections cost only memory (~100KB). 50M connections = ~250 servers. Server can push at any time.
✓ Necessary at this scale
Rejected
HTTP Long Polling
Works. Used by early chat systems. But every poll is a full HTTP request — TCP + TLS overhead. At 50M users polling every 3 seconds: ~17M requests/sec just to check for messages. Unsustainable.
✗ Too expensive at scale
Chosen
Cassandra
Linear horizontal scaling. Append-only write path (memtable + commit log). Access pattern is always "last N messages in conversation X" — perfectly fits partition-by-conversation model. Handles 1.2M writes/sec natively.
✓ Justified by write characteristics
Rejected for messages
PostgreSQL
ACID, joins, complex queries. But primary-node writes become the bottleneck. Manual sharding by conversation_id is operationally painful. No advantage here since we never need joins on the message read path.
~ Right for users / metadata tables
Chosen
At-Least-Once Delivery
Kafka publishes, DB consumer and delivery consumer run independently. If delivery consumer crashes post-delivery but pre-ack, message is replayed. Client deduplicates via message_id. Duplicates are rare and invisible.
✓ Industry standard approach
Rejected
Exactly-Once Delivery
Requires distributed transactions and two-phase commits across Kafka + DB + delivery layer. Adds significant latency overhead. At 1.2M writes/sec the cost is prohibitive. The benefit (no duplicates) is solvable client-side for free.
✗ Too expensive for marginal gain
Chosen
Availability over Consistency
Messages queue when downstream components degrade. Kafka accepts writes even if Cassandra is slow. Delivery router queues offline even if Redis is flaky. System never rejects a send — it just delivers later.
✓ Users tolerate delay, not failure
Rejected
Consistency over Availability
Would require all replicas to confirm before acknowledging a send. Under any network partition, sends fail with errors. UX catastrophe — users panic on failed sends, retry storms compound the problem.
✗ Wrong tradeoff for chat
09
What Can Go Wrong
💥
Chat Server Crash — Thundering Herd
Server holding 200K connections crashes. All 200K users attempt reconnection simultaneously. Remaining servers at 65% capacity get slammed — potentially cascading to more crashes. Redis session entries are stale, pointing to dead server.
Recovery: Load balancer detects dead server via health check (every 5-10s), stops routing. Redis TTL (60s) auto-expires stale session entries — no manual cleanup. Clients reconnect with exponential backoff + jitter: 200K reconnects spread over 30-60 seconds instead of simultaneously. Messages missed during outage window delivered from offline queue on reconnect.
→ Fix: Exponential backoff + jitter on client · Redis TTL self-cleanup · offline queue covers message gap
🔴
Redis Goes Down — Routing Blind
Connection registry unavailable. Delivery router can't find recipient's server. All real-time delivery to online users breaks. Crucially: no messages are lost — Kafka and Cassandra are still running. The only broken path is real-time delivery.
Recovery: Delivery router treats all users as offline, queues everything. Redis Sentinel / Redis Cluster promotes replica within ~30 seconds. Real-time delivery resumes. Queued messages flush automatically. Users see a ~30s delay on message delivery — not an error.
→ Fix: Redis Cluster with automatic failover · graceful degradation to offline queue path
📢
Group Fan-out Explosion
One message in a 256-person group where all are online generates: 256 Redis lookups + 256 WebSocket pushes + 256 delivered receipts + 256 receipt deliveries = ~1,000+ operations per message send. At scale with many active groups, the delivery router becomes a bottleneck.
Mitigation: Batch the 256 Redis lookups into a single pipeline call — one round trip instead of 256. WhatsApp's 256-member cap is a deliberate architectural decision, not a product preference — it's the threshold where server-side fan-out remains manageable. For systems with larger groups (Telegram at 200K members), flip to fan-out on read.
→ Fix: Redis pipelining for batch lookups · group size caps · fan-out on read for very large groups
⚡
Kafka Consumer Lag
If a Kafka consumer (message writer, delivery router, receipt handler) falls behind, messages pile up in the Kafka topic. Delivery delays grow. If the consumer crashes entirely, unprocessed messages wait until it recovers or another instance takes over.
Kafka's consumer group model handles this automatically — multiple consumer instances share partitions. If one crashes, its partitions are reassigned to healthy instances within seconds. Messages are not lost because Kafka retains messages until explicitly committed.
→ Fix: Consumer groups with multiple instances · monitor consumer lag · auto-scaling consumers on lag spike
⚠
Anti-patterns
🚫
Relational DB with rows per message
~100B messages/day on a RDBMS is hopeless.
✓ Better: Cassandra (partition by chat_id, clustering by ts); cold data tiered to cheaper storage.
🚫
Always-on server storage for all media
S3 bills at EB scale get expensive.
✓ Better: E2E encrypted media; server holds encrypted blob briefly, expires after recipient fetches.
Lead with why WebSockets, not just that you're using them. Say: "HTTP long polling would generate ~17M requests/second at our scale just to check for new messages. WebSockets are persistent — 50M idle connections costs only ~5TB of RAM across ~250 servers." Numbers win interviews.
02
Name the connection registry immediately. Most candidates skip this. The question "how does Server A know Bob is on Server B?" will always come. Have the answer ready: Redis maps user_id → server_id, 60-second TTL refreshed by heartbeat, auto-expires on crash. This one component differentiates good answers from great ones.
03
Kafka appears in two directions. Interviewers will ask "what is Kafka doing here?" The answer has two parts: (1) message delivery pipeline from sender to recipient, and (2) receipt events flowing back from recipient to sender. Make sure you explain both or your design feels incomplete.
04
The 256 group cap is an architecture decision. When asked why WhatsApp limits group size, don't say "product choice." Say: one group message to 256 online members generates ~1,000 operations. At scale with millions of active groups, the fan-out math dictates the cap. Shows you think in systems, not features.
05
E2EE in one sentence, then move on. "Signal Protocol — keys are exchanged on registration and stored only on devices. The server is a blind relay forwarding ciphertext it cannot read." This shows depth without burning 10 minutes on cryptography that won't be tested.
06
When asked about consistency: say "availability over consistency" and immediately explain what that means concretely — messages queue, not fail. Users tolerate a 300ms delay. They do not tolerate a send failure. The CAP tradeoff should come with a user experience justification, not just the theory.
11
Similar Problems
WhatsApp contains several sub-problems that appear as standalone system design questions. Mastering this one gives you significant head starts on all of them.
Single Node.js server with Socket.io or basic long polling. SQLite or PostgreSQL for messages. No Kafka, no Redis, no separate push service. Ship fast — premature real-time infrastructure kills early-stage products.
Phase 2 — 10K to 500K users
Upgrade to True WebSockets + Add Redis
Migrate to real WebSocket connections. Add Redis for session registry and presence. Separate the database. Add read replicas. Introduce basic push notifications. Most startups live here for years — this handles tens of millions of messages per day comfortably.
Phase 3 — 500K to 50M users
Kafka + Cassandra + Dedicated Chat Servers
Introduce Kafka as the message pipeline. Migrate message storage to Cassandra as write volume overwhelms PostgreSQL. Separate chat servers from REST API servers. Add consumer microservices for delivery routing, receipts, offline queuing. This is the architecture we designed today.
Phase 4 — 50M+ users
Multi-Region + Custom Infrastructure
Deploy to multiple geographic regions — Americas, Europe, Asia-Pacific. Private fibre backbone between DCs for sub-50ms inter-region latency. Global load balancing routes users to nearest region. Cassandra multi-region replication. At WhatsApp scale: ~250 chat servers hold 50M WebSocket connections — surprisingly lean for 2 billion users.