A social media feed that shows posts from people you follow, ranked algorithmically, updated in near-real-time. The core challenge: one celebrity post needs to reach 10 million followers instantly — without the system melting.
Users see a personalized feed of posts from followed accounts
Feed is algorithmically ranked — not pure reverse-chron
Users can like, comment, share posts
Comments support nested replies
Push notifications when followed users post
Non-Functional
Feed load under 200ms p99 latency
99.99% availability — feed never fully down
Handles celebrities with 10M+ followers
Eventual consistency acceptable — 1-2s post propagation fine
Horizontally scalable — no single bottleneck
Spam and bot detection on comments
03
Scale Estimation
Metric
Calculation
Result
Daily Active Users
Twitter/Instagram scale assumption
500M
Posts/day
500M × 1 post per 3 days
~167M
Post write RPS
167M ÷ 86,400 (peak 2×)
~1,900 / peak 4K
Feed loads/day
500M × 5 sessions × 2 pages
2.5B
Read RPS
2.5B ÷ 86,400 (peak 2×)
~29K / peak 60K
Likes/sec
500M × 10 likes/day ÷ 86,400
~58K / peak 120K
Comments/sec
500M × 2 comments/day ÷ 86,400
~11,500
Total engagement writes/sec
likes + comments + shares
~87K / peak 180K
Fan-out Redis writes/sec
1,900 posts × avg 200 followers
~380K
Redis feed cache size
500M users × 1.6 KB per feed
~800 GB
Post storage (5 years)
167M/day × 500B × 365 × 5
~150 TB
Key Insight
The 15:1 read-to-write ratio drives every architectural decision. Optimise aggressively for reads. The 87K engagement writes/second is the real write challenge — Postgres alone cannot handle this. The fan-out problem (380K Redis writes/sec) is why the hybrid strategy exists.
Client → API Server → Posts DB (sync) → Kafka event (async) → Fan-out workers → Redis sorted set per follower. Post creation is fast; fan-out happens in the background.
Read Path
Client → API Server → Redis feed cache (top 200 candidates) → hydrate post content → Re-ranker (10-20ms) → return top 20. Feed load target: under 50ms including re-ranking.
Kafka as shock absorber
Decouples post creation from fan-out. A celebrity post triggers 10M Redis writes — doing this synchronously would make post creation painfully slow. Kafka absorbs the spike; workers consume at their own pace.
Cassandra for engagement
87K engagement writes/second. Cassandra handles 100K+ writes/sec on a modest cluster with linear horizontal scaling. Postgres with this write volume would need 20+ shards — operational nightmare.
06
Deep Dive — The Fan-out Problem
The Core Question
When Beyoncé posts, her 10 million followers all need to see it. When do you update their feeds? At write time (when she posts) or at read time (when each follower opens the app)? Neither works alone. The hybrid approach is the real answer.
Option A
Fan-out on Read
Build each follower's feed when they open the app. Merge posts from all 500 followed accounts. Write cost: cheap (1 DB write). Read cost: 500 DB queries per feed load at 29K read RPS = catastrophic.
✗ Fails at scale
Option B
Fan-out on Write
Push post ID into every follower's Redis feed immediately. Read: O(1) — one Redis lookup. Write: 10M Redis writes for a celebrity post. Blocks post creation. Write storm problem.
~ Celebrity problem
Option C — Chosen
Hybrid Fan-out
Regular users (<10K followers) → fan-out on write. Cheap to push to small follower lists. Celebrities (>10K) → fan-out on read. At read time, merge the pre-built feed with fresh posts from a handful of celebrity accounts you follow. The merge is tiny (5-10 sources), not catastrophic.
✓ Best of both worlds
Algorithmic Ranking
Two-Stage Ranking
Stage 1 (write time): Pre-score = unix_timestamp + small engagement boost. Stored as sorted set score. Fast, no ML. Stage 2 (read time): Pull top 200 candidates, run lightweight re-ranker with fresh signals (10-20ms). ML model trains offline only — never scores at query time.
✓ Fast + relevant
Sequence — Feed Read FlowMermaid.js
sequenceDiagram
participant C as Client
participant API as API Server
participant Redis as Feed Cache
participant Rank as Re-ranker
participant DB as Posts DB
C->>API: GET /v1/feed
API->>Redis: ZREVRANGE feed:user123 0 199
alt Feed exists (cache hit)
Redis-->>API: 200 post IDs + scores
API->>DB: Hydrate content (batch fetch)
DB-->>API: Post content
API->>Rank: Re-rank top 200 (~15ms)
Rank-->>API: Top 20 ranked posts
API-->>C: 200 OK — feed in ~50ms
else Cache miss (expired or new user)
API->>DB: Rebuild feed from follows + posts
DB-->>API: Posts from followees
API->>Redis: Cache rebuilt feed (TTL 24h+jitter)
API-->>C: 200 OK — feed in ~300ms
end
Request Flow — Step Through
Client · Posts content→API Server · Stateless→Posts DB · Postgres→Kafka · Fan-out queue→Fan-out Worker · Parallel→Redis Feed · Sorted set
Click Next Step to walk through the request flow.
11
Key Design Decisions & Tradeoffs
Chosen
Async fan-out via Kafka
Post creation is fast (~10ms). Fan-out happens in background. Kafka provides durability, retry, and backpressure. A celebrity post triggers 10M Redis writes without slowing the API.
✓ Decoupled, durable, scalable
Alternative
Synchronous fan-out
Write to all follower feeds before returning 200 OK. Simple to reason about, but post creation latency scales with follower count. Celebrity posts time out. No retry on partial failure.
✗ Unusable at celebrity scale
Chosen
Redis INCR + Cassandra + Postgres flush
Three-layer engagement writes. Redis absorbs 87K writes/sec. Cassandra stores durable records at scale. Postgres gets batch-flushed aggregates. Each layer does what it's best at.
✓ Handles 180K writes/sec peak
Alternative
Postgres only for engagement
Simpler. Works fine up to ~20K writes/sec. At 87K sustained, needs 20+ shards. Operationally painful. Index contention on hot posts during viral moments causes cascading slowdowns.
~ Fine at <10K users
Chosen
Materialized path for comments
Single index scan for any subtree. O(1) insert. ORDER BY path returns depth-first tree order. No recursive CTEs. Works at celebrity post scale (50K comments/min).
✓ Fast reads + fast writes
Alternative
Adjacency list
Simpler schema. But fetching a thread requires recursive CTEs — one DB join per nesting level. Catastrophic on a post with 50K comments and 10 levels of nesting under viral conditions.
✗ Recursive queries fail at scale
12
What Can Go Wrong
Design Principle
Every failure answer has three parts: what breaks, what the user experiences, and how you recover. Not just "add a replica" — but what happens in the gap before the replica promotes, and does data get lost?
⚡
API Server Crash
In-flight requests fail. No data loss because API servers are stateless — no in-memory state that matters. LB health checks detect the dead instance within 5-10 seconds and stop routing to it.
→ Fix: Stateless design + LB health checks every 5s → Fix: Min 3 instances across 3 AZs — one AZ down = no impact → Fix: Graceful shutdown — drain in-flight requests over 30s on SIGTERM
🌊
Kafka Lag / Broker Down
Fan-out workers fall behind. Posts are written to the DB but don't appear in follower feeds for minutes. If a broker dies, partition leaders re-elect in seconds. Consumer offsets committed to Kafka — if a worker crashes mid-processing, it restarts from the last committed offset. No data loss.
→ Fix: Replication factor 3 on all topics → Fix: Alert on consumer lag > 100K messages or growing for > 5 min → Fix: Auto-scale fan-out workers based on lag metric → Fix: Dead-letter queue for messages failing after 3 retries
🔄
Fan-out Worker Crash Mid-Processing
A worker fans out a celebrity post to 800K of 2M followers, then crashes. 1.2M followers never get the post. Without protection: permanent gap. With Kafka: offset not committed → reprocessed on restart. The 800K writes already done are idempotent (Redis sorted sets deduplicate by member).
→ Fix: Never commit Kafka offset before all writes succeed → Fix: Checkpoint progress in Redis for >1M follower fan-outs → Fix: Parallel workers with consistent hash bucketing — 2M ÷ 20 workers = 10s total
🔥
Redis Cluster Down
Feed cache gone. Like/comment counts show 0. Rate limiting stops working (risk: spam through). Session tokens unvalidatable. This is the scariest failure because Redis is in both the hot read and write path.
→ Fix: Graceful fallback — try Redis, on failure rebuild feed from Postgres (~300ms vs ~5ms) → Fix: Redis Cluster — 3 primary shards × 1 replica, multi-AZ → Fix: AOF persistence — worst case: lose last 1 second of counter data → Fix: Rate limiting fallback — fail open with aggressive logging, not fail closed
💾
Postgres Primary Goes Down
All writes fail for 20-30 seconds during failover. Reads continue from replicas. New posts cannot be created. During the gap, API queues writes to Kafka rather than dropping them — when Postgres returns, they land safely.
→ Fix: Patroni/PGBouncer for automatic failover (~20s total) → Fix: synchronous_commit = on for post creation — zero data loss on failover → Fix: Kafka buffers writes during the 20s gap — no user-visible data loss
⚡
Thundering Herd on Cache Expiry
All users' feed caches expire at the same time (24h TTL). Thousands of simultaneous requests hit Postgres to rebuild feeds. DB gets crushed before caches are repopulated. Latency spikes → timeouts → potential cascade.
→ Fix: TTL jitter — base 24h + random(0, 3600s) spreads expiry over 1 hour → Fix: Mutex lock on cache rebuild — only one request triggers rebuild per user → Fix: Proactive cache warming — background job pre-warms feeds for users with last_active_at > 20h
🌡️
Hot Shard (Celebrity Problem)
Sharding by user_id means Beyoncé's posts all land on one shard. 100M followers reading her posts → that shard gets 10-100× the traffic of any other shard. Standard sharding doesn't help — the problem is data locality, not distribution.
→ Fix: Celebrity-dedicated shards — accounts >10K followers get their own shard → Fix: Post content cache in Redis — celebrity posts almost never hit Postgres directly → Fix: Extra read replicas for hot shards (5 replicas instead of 2)
🏔️
Cascading Failure — the Most Dangerous
Slow Postgres → API servers wait longer → thread pools fill up → new requests queue → memory fills → API starts timing out all requests including ones that don't need the DB. One slow component takes down the entire service.
→ Fix: Circuit breaker — CLOSED → OPEN (50% failure rate in 10s) → HALF-OPEN (30s) → CLOSED → Fix: Bulkheads — separate thread pools for feed reads, post writes, fan-out → Fix: Explicit timeouts — API→Redis: 50ms, API→DB: 500ms, API→Kafka: 200ms → Fix: When circuit opens, serve stale feed from Redis rather than 503
Failure
User Impact
Recovery Time
Key Mechanism
API server crash
Brief errors on in-flight requests
< 10s
Stateless + LB health checks
Kafka lag
Feed delay — not data loss
Minutes
Consumer auto-scaling
Fan-out worker crash
Partial feed, auto-retried
Seconds
Kafka offset replay
Redis cluster down
Slower feeds (DB fallback)
< 30s
Replica promotion + fallback code
Postgres primary down
20-30s write outage
20-30s
Patroni failover + Kafka buffer
Thundering herd
Slow feeds during spike
Minutes
TTL jitter + mutex lock
Hot shard
Degraded for celebrity content
Ongoing
Dedicated shards + content cache
Cascading failure
Full outage if unmitigated
Minutes–hours
Circuit breaker + bulkheads
⚠
Anti-patterns
🚫
Scan all posts, filter by follow graph, sort by time
O(N × following) per feed load. Dies past ~1M users.
✓ Better: Pre-computed sorted-set feed per user; fan-out-on-write for normal users.
🚫
ElasticSearch as primary feed store
Great for search; suboptimal for infinite-scroll + real-time inserts.
✓ Better: Redis sorted sets for hot feed; ES for search + analytics.
🚫
Re-rank every feed request from scratch
Ranking 5000 candidates per load × 100M DAU is compute-bound.
Lead with the fan-out question before drawing anything. The first thing you say after "I'll design a news feed" should be: "The core tension is the fan-out problem — when a celebrity with 10M followers posts, when do I update their feeds?" Then answer it. This frames the entire design and signals you know what makes this problem hard.
02
The hybrid strategy is not optional at FAANG scale. Pure fan-out on read fails (500 queries per feed load). Pure fan-out on write fails (celebrity problem). Hybrid is the only correct answer. Name the threshold explicitly: >10K followers = celebrity = fan-out on read. Interviewers will push you on where you draw the line.
03
Engagement writes are the trap most candidates miss. If you say "write likes to Postgres" in a 500M DAU system, you've already lost the interviewer. The number — 87,000 writes/second — is what justifies Redis counters + Cassandra. Always derive the number before proposing the solution. "87K writes/second exceeds Postgres's single-node limit by 4×, so here's my three-layer approach…"
04
Name the composite index explicitly. Say: (user_id, created_at DESC) is the most important index in this system. Explain why: celebrity fetches, feed rebuilds, and fan-out lookups all use it. Without it, every such query is a full table scan. Interviewers who care about databases will nod and move on — this is the signal that you understand the data access patterns.
05
Always address the follow graph sharding problem unprompted. If you shard the follows table, you can't efficiently serve both "who follows me?" and "who do I follow?" from the same shard. The answer — store it twice — shows you understand write amplification as a conscious tradeoff. Most candidates either ignore this or give a vague answer about "using a graph database."
06
For failure scenarios: three parts every time. What breaks → what the user experiences → how you recover. Don't just say "add a read replica." Say: "During the 30-second Postgres failover, API servers queue writes to Kafka instead of failing them. The user gets an optimistic 200 OK. By the time they refresh, Postgres is back and the post is durably stored." That's a complete answer.
07
On comments: say "materialized path" immediately. Most candidates say "adjacency list" or "just store parent_id." Materialized path — where you store the full ancestry as a dot-separated string — gives O(1) writes and a single LIKE index scan for any subtree. Name it, describe it in one sentence, and move on. It signals you've thought about tree storage beyond the toy example.
08
Know when NOT to use this stack. The Kafka + Flink + Cassandra + Redis architecture is justified at 87K writes/second. At 500 writes/second, it's massive over-engineering. If an interviewer asks "what would you do differently at 10K users?", say: "Synchronous Postgres writes, no Kafka, no Cassandra. This stack adds 6 failure points — you don't need it until you've hit Postgres's limits." Shows maturity and pragmatism.
09
The cascading failure question is almost always asked. Prepare: "If Postgres slows down, API thread pools fill up and the whole service degrades — even endpoints that don't touch the DB. The solution is a circuit breaker: if the failure rate exceeds 50% in 10 seconds, trip the breaker and serve stale feeds from Redis rather than timing out everything." This answer shows systems thinking, not just component knowledge.
10
For the OnlyFans / paid content follow-up: flip the model. "Instead of fan-out, think content broadcast. Pre-warm the CDN before firing push notifications. 500K subscribers hit the CDN — every read is a cache hit, origin DB never touched. Separate access control via signed URLs so auth doesn't block content delivery." This three-beat answer covers the thundering herd, CDN strategy, and access control in under 30 seconds.
14
Similar Problems
These problems share the same patterns. Mastering the news feed gives you a head start on all of them.
One server, one database, reverse-chronological feed built at read time. No cache, no queue, no fan-out service. Ship fast — premature optimisation kills startups. Feed is slow but the product can be validated.
Phase 2 — 10K to 500K users
Separate DB + Redis Cache
Move DB to its own server. Add Redis in front for hot reads. CDN for media. Read replica on Postgres. Start pre-building feeds for active users in Redis. Most products live here their entire lives.
Phase 3 — 500K to 10M users
Async Fan-out + Sharding
Add Kafka for async fan-out. Shard Postgres by user_id. Introduce hybrid fan-out (celebrities vs regular users). Engagement writes move to Redis counters + periodic Postgres flush. First detection of hot shards.
Phase 4 — 10M to 100M users
Full Pipeline + Cassandra
Engagement records move to Cassandra. Add Flink for stream processing (dedup, trending, counter flush). Algorithmic ranking introduced — two-stage scoring. Follow graph moves to Redis sets. Celebrity-dedicated shards. CDN pre-warming for high-fan-count creators.
Phase 5 — 100M+ users
Multi-Region + Custom Infrastructure
Multi-region active-active with global load balancing. Regional feed caches. ML ranking model updated hourly. Separate microservices for every major component. Most engineers never operate here — but knowing it shows architectural depth.