System Design — 016

Reddit-Style Comments

Design a nested, threaded comment system with voting, ranking algorithms, and tree pagination — where the hardest problem isn't storing data, it's retrieving a sorted, truncated tree in under 200ms.

Tree Data StructureRanking AlgorithmsRead-HeavyCache StrategyCursor Pagination
01

Problem Statement

Design a comment system like Reddit's, where users can write top-level comments on posts, reply to any comment (creating arbitrarily deep nested threads), upvote or downvote any comment, and view the comment tree sorted by different ranking algorithms (best, top, new, controversial).

Unlike flat comment systems (YouTube, Instagram), Reddit comments form a rooted, ordered, variable-depth tree. Every node at every level can be independently voted on and ranked. The tree can grow to tens of thousands of comments on viral posts while needing to load in under 200ms.

Core question: How do you store, retrieve, rank, and paginate a tree structure at massive read scale — where every node has a rapidly changing vote score?

What's in scope: Nested comments, voting, ranking, tree pagination, "load more replies", "continue this thread", soft deletion.

What's out of scope: Post system, subreddits, moderation tools, awards/gilding, user profiles, notification delivery internals, real-time WebSocket push.

02

Requirements

Functional Requirements

  • Create comment — top-level on a post, or reply to any existing comment (tree grows deeper)
  • Vote — upvote, downvote, or remove vote on any comment. One vote per user per comment, changeable.
  • View comment tree — fetch a sorted, truncated, nested tree for a post. Multiple sort orders: best, top, new, controversial, old.
  • Load more replies — expand truncated sibling branches. "Load more replies (42)" at any tree node.
  • Continue thread — navigate into deeply nested chains beyond the initial depth cutoff.
  • Edit & delete — author can edit body text or soft-delete (shows "[deleted]", preserves tree structure).

Non-Functional Requirements

  • Read latency < 200ms — comment tree load must feel instant. This drives the entire caching strategy.
  • Eventual consistency on votes — vote counts can lag by up to 60 seconds. Reddit itself fuzzes displayed scores. This is a gift architecturally.
  • Read-your-own-writes for comments — if a user posts a reply and refreshes, they must see it immediately.
  • High availability — reading comments must work even during partial infrastructure failures.
  • Bounded response size — max ~200 comments per response (~80 KB payload), regardless of total tree size.

The requirement that drives everything: Fetch a sorted, truncated, nested tree in under 200ms for a post that might have 50,000 comments.

03

Scale Estimation

Every number derived from assumptions — adjust the inputs and watch the architecture change.

~35K/s
Peak comment reads/sec
~175/s
Peak comment writes/sec
~2,600/s
Peak vote writes/sec
200:1
Read-to-write ratio

Derivation Chain

500M MAU → 50M DAU (10%) → 2% comment → 3M comments/day (~35/s avg, ~175/s peak at 5×).
9% of DAU vote, 10 votes each → 45M votes/day (~520/s avg, ~2,600/s peak).
80% of DAU view comments, 15 pages each → 600M page views/day (~7K/s avg, ~35K/s peak).

Storage

MetricValueInsight
Comment size (row)~500 bytesText + metadata + indexes
Daily comment storage1.5 GB/dayNot a sharding driver
Annual storage~550 GB/yearSingle DB can hold 5 years
Vote record size~50 bytesGrows faster than comments
Response payload~80 KB~200 comments × 400 bytes
Peak bandwidth~2.8 GB/sCDN + cache absorbs this

What the numbers tell us: The system is overwhelmingly read-heavy (200:1). Vote counting is the hottest write path, not comment creation. Storage is a non-issue — we shard for read throughput, not capacity.

04

API Design

Seven endpoints. The complexity isn't in the API surface — it's in the tree assembly logic behind the read endpoints.

Get Comment Tree (the hardest endpoint)
GET /api/v1/posts/{post_id}/comments
    ?sort=best          // best | top | new | controversial | old
    &depth=3            // max nesting depth to return
    &limit=20           // max top-level comments
    &child_limit=5      // max children per comment per level
    &cursor={token}     // opaque pagination token

→ 200 { data: { post_id, total_comments, sort, comments: [
    { id, user: {id, username, avatar_url}, body, depth,
      upvotes, downvotes, score, user_vote,
      child_count, has_more_children, children: [...] }
  ], pagination: { next_cursor, has_more } } }
Create Comment (top-level or reply)
POST /api/v1/posts/{post_id}/comments
Auth: Bearer {token}
Body: { parent_id: "uuid" | null, body: "text" }
→ 201 { data: { id, post_id, parent_id, user, body, depth, ... } }
Load More Replies
GET /api/v1/comments/{comment_id}/replies
    ?sort=best&limit=10&depth=3&cursor={token}
→ 200 { data: { parent_id, replies: [...], pagination: {...} } }
Continue Thread (deep chain)
GET /api/v1/comments/{comment_id}/thread?sort=best&limit=20
→ 200 { data: { ancestor_chain: [...], focus_comment: {...}, pagination } }
Vote on Comment
PUT /api/v1/comments/{comment_id}/vote
Auth: Bearer {token}
Body: { vote: 1 | -1 | 0 }    // upvote | downvote | remove
→ 200 { data: { comment_id, upvotes, downvotes, score, user_vote } }
Edit & Delete
PATCH /api/v1/comments/{id}   → body update, sets edited_at
DELETE /api/v1/comments/{id}  → soft delete: is_deleted=true, body="[deleted]"

Key Design Decisions

Cursor-based pagination

Offset pagination breaks on sorted, dynamic datasets. Cursors encode (wilson_score, id) for stable keyset pagination.

Server owns tree assembly

Client receives a pre-assembled, sorted, truncated nested JSON tree — not a flat list to reconstruct.

PUT for votes (idempotent)

One endpoint handles upvote, downvote, and removal. Upvoting twice = same effect. Simpler than separate endpoints.

Soft delete preserves tree

Hard-deleting a parent orphans all children. Soft delete keeps tree structure intact, showing "[deleted]" placeholder.

05

High-Level Architecture

Every component exists because a specific number demanded it. The 200:1 read-to-write ratio drives the cache-first design. The 2,600 votes/sec hot-key problem drives the Redis counter layer. The 35,000 reads/sec demands pre-assembled tree caching.

Client Web / Mobile CDN CloudFront Load Balancer ALB L7 API Servers ×18 · Stateless Redis Cluster Tree Cache + Votes PostgreSQL 1 Primary + 3 Read Kafka Event Queue Workers Score Calc Cache Inv. PgBouncer Conn Pool HTTPS REST Cache R/W Read/Write Events Consume Batch

Component Summary

ComponentTechWhy It Exists
CDNCloudFrontAbsorb 2.8 GB/s bandwidth, short-TTL cache for comment pages
Load BalancerALB (L7)Distribute across 18 stateless API servers, health checks
API Servers ×18GoHandle tree assembly, caching, auth. Sized for 35K reads/s
Redis Cluster ×6Redis 7Pre-assembled tree cache, vote counters, rate limits, user vote state
PostgreSQLPG 16Durable storage. 1 primary (writes) + 3 replicas (reads)
PgBouncerMultiplex 360 app connections over 100 DB connections
Kafka3-brokerAsync: score recalculation, cache invalidation, notifications
WorkersGoScore calculator (batch votes), cache invalidator, notification sender
Request Flow — Step Through
ClientCDNLoad BalancerAPI ServerRedis CachePostgreSQLAssemble TreeReturn
Click Next Step to walk through the request flow.
06

Deep Dive — Tree Retrieval & Pagination

This is the hardest engineering problem in the system. Flat list pagination is one-dimensional. Tree pagination operates in three dimensions simultaneously: breadth (siblings per level), depth (how many levels), and sort order (different at every level). No off-the-shelf database feature handles this.

Why the Naive Approach Fails

Fetching all 50,000 comments for a viral post and assembling in memory: 25 MB from DB, 40 MB in memory, 20 MB JSON response — but we only show ~200 comments. That's 250× more data than needed.

Our Approach: Top-Down Level-by-Level Fetching

Fetch only the comments we'll render, level by level, using window functions to select the top-N children per parent at each depth.

sequenceDiagram participant C as Client participant A as API Server participant R as Redis Cache participant DB as PostgreSQL C->>A: GET /posts/{id}/comments?sort=best A->>R: GET tree:{post_id}:best:page1 alt Cache HIT (90%) R-->>A: Compressed JSON tree A->>R: HGETALL user_votes:{user}:{post} R-->>A: Vote map A-->>C: Tree + user votes (3-5ms) else Cache MISS (10%) R-->>A: null A->>R: SET lock:tree:{post_id} NX EX 5 A->>DB: Top 20 top-level (wilson_score DESC) DB-->>A: 20 rows A->>DB: Top 5 children per parent (window fn) DB-->>A: ≤100 rows A->>DB: Top 3 grandchildren per parent DB-->>A: ≤300 rows A->>DB: Top 2 great-grandchildren per parent DB-->>A: ≤600 rows A->>A: Assemble tree (~1ms) A->>R: SETEX tree (45s TTL) A-->>C: Tree response (30-85ms) end

The Window Function Query

Each level uses a single batch query with ROW_NUMBER() OVER (PARTITION BY parent_id ORDER BY wilson_score DESC) to select the top-N children per parent — avoiding N separate queries.

SELECT * FROM (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY parent_id
            ORDER BY wilson_score DESC
        ) as rank
    FROM comments
    WHERE parent_id IN ('c1', 'c2', ... 'c20')
) ranked
WHERE rank <= 5;

Fetch Budget Per Level

LevelParentsChildren/ParentMax RowsQueries
0 (top-level)20201
12051001
2≤10033001
3≤30026001
Total≤1,0204

We fetch ≤1,020 comments in 4 queries vs. 50,000 in one — a 50× data reduction at the cost of 3 extra round trips.

Three Types of Cursors

Top-Level Pagination

Keyset cursor: (wilson_score, id) < (0.723, 'c20'). Loads the next 20 root comments with their subtrees.

Load More Siblings

Scoped to one parent_id. Same keyset pattern, fetching the next batch of children for a specific comment.

Continue Thread

Fresh subtree rooted at the target comment. Includes ancestor chain for breadcrumb context (free via materialized path).

Cache Strategy

Cache the generic tree without user_vote. Overlay each user's votes per-request from a separate Redis hash. One cache serves all users.

Data Model: Adjacency List + Materialized Path

We store parent_id for write simplicity (O(1) INSERT) and a path column for read power (prefix-based subtree queries, ancestor lookups without recursion). Best of both worlds — since comments are never re-parented, the path's biggest weakness doesn't apply.

CREATE TABLE comments (
    id              UUID PRIMARY KEY,
    post_id         UUID NOT NULL,
    parent_id       UUID REFERENCES comments(id),
    path            TEXT NOT NULL,        -- "uuid-A.uuid-B.uuid-C"
    user_id         UUID,
    body            TEXT NOT NULL,
    depth           INT NOT NULL DEFAULT 0,
    upvotes         INT NOT NULL DEFAULT 0,
    downvotes       INT NOT NULL DEFAULT 0,
    score           INT NOT NULL DEFAULT 0,
    wilson_score    FLOAT NOT NULL DEFAULT 0,
    controversial_score FLOAT NOT NULL DEFAULT 0,
    direct_child_count  INT NOT NULL DEFAULT 0,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    is_deleted      BOOLEAN NOT NULL DEFAULT false
);

-- Key indexes
CREATE INDEX idx_post_path ON comments (post_id, path);
CREATE INDEX idx_parent_wilson ON comments (parent_id, wilson_score DESC);
CREATE INDEX idx_post_wilson ON comments (post_id, wilson_score DESC);

CREATE TABLE comment_votes (
    user_id     UUID NOT NULL,
    comment_id  UUID NOT NULL REFERENCES comments(id),
    vote_type   SMALLINT NOT NULL,  -- 1, -1
    PRIMARY KEY (user_id, comment_id)
);

Why not NoSQL? A document store (MongoDB) hits the 16 MB doc limit on viral posts. Updating one comment's score requires read-modify-write on a multi-MB document. PostgreSQL's window functions, CTEs, and row-value comparisons are purpose-built for our tree queries.

07

Key Design Decisions & Tradeoffs

Tree Storage Model

✓ Chosen

Adjacency List + Materialized Path

parent_id for O(1) writes, path column for prefix subtree queries without recursion. Slight write overhead (one extra read for parent's path). Redundant data, but 200:1 read ratio means optimizing reads is always right.

✗ Alternatives

Closure Table / Nested Sets

Closure table: O(depth) rows per insert, storage explodes. Nested sets: every insert renumbers half the table. Both unworkable at 3M comments/day write rate.

Sort Score Computation

✓ Chosen

Precomputed + Async Updated

wilson_score stored as indexed column, updated by batch worker every 500ms. Enables index-driven ORDER BY — no full-table scan. Scores lag by up to 60s (cache TTL), invisible to users.

✗ Alternative

Compute on Read

Calculate Wilson score in SQL at query time. Prevents index usage, forces O(n) computation for every query. At 35K reads/s, requires thousands of DB instances. Works fine under 1K reads/s.

Vote Consistency Model

✓ Chosen

Eventual Consistency

Vote counts lag by seconds. Reddit already fuzzes displayed scores. Enables aggressive caching and Redis-based fast counters. User's own vote is always reflected instantly (optimistic update).

✗ Alternative

Strong Consistency

Every vote immediately UPDATEs the comment row. At 500 votes/sec on a viral comment, row-lock serialization makes the 500th voter wait ~500ms. Kills write throughput and prevents caching.

Caching Strategy

✓ Chosen

Cache Full Pre-Assembled Trees

One Redis GET → decompress → overlay user votes → return. Sub-5ms on cache hit. Invalidation is coarse-grained (entire post tree), but it only happens once per 30-60s TTL.

✗ Alternative

Cache Individual Comments

Surgical invalidation (update one comment without touching others). But requires 2+ Redis operations per read plus client-side assembly. 3-5ms vs sub-1ms. Better for real-time vote displays.

Tree Fetch Strategy

✓ Chosen

Level-by-Level Window Queries

4 sequential queries fetching ≤1,020 rows total. Precise per-level truncation via ROW_NUMBER(). 8-20ms network overhead, but 50× less data than fetching everything.

✗ Alternative

Single Recursive CTE

One query, one round trip. But fetches ALL children (no per-level limits), returns 50K rows for viral posts. Can't push truncation into the recursive step. Only viable for small trees.

Async Processing

✓ Chosen

Kafka Event Queue

Side effects (scores, cache, notifications, karma) processed async. Comment creation takes 50ms instead of 150ms. Events are durable, replayable, and support independent consumer scaling.

✗ Alternative

Synchronous Processing

Simpler — no Kafka, no workers, no consumer lag monitoring. But adds ~100ms to every write. Couples the comment API to every downstream system. Fine for MVP, painful at scale.

08

What Can Go Wrong

Cache Stampede on Viral Posts

Cache TTL expires on a post with 2M viewers → 35K simultaneous cache-miss rebuilds hit the database. Fix: Distributed lock (only one server rebuilds), stale-while-revalidate backup (serve slightly old data while rebuilding), adaptive TTL (shorter for hot posts, longer for cold).

Hot-Key Vote Contention

Top comment on a viral post: 500 votes/sec on one DB row → row-lock serialization, cascading timeouts. Fix: Redis HINCRBY for instant counting (100K ops/sec, no locks), Kafka + batch worker collapses 500 writes/sec into 1 batched UPDATE/sec.

Database Primary Failure

Primary crashes → all writes fail. Reads continue from cache + replicas. Fix: Automated failover with Patroni/pg_auto_failover (30-60s). Kafka retains unprocessed vote events until new primary is ready. Zero vote loss if Kafka has acks=all.

Cascading Timeout Collapse

DB goes from 10ms to 200ms queries → API timeouts → client retries → 3× load → total collapse. Fix: Explicit timeouts at every boundary, circuit breaker on DB connections (stop sending requests during recovery), exponential backoff with jitter on retries, load shedding (reject excess requests with 503).

Cache-Database Inconsistency

Cache invalidation worker lags → user posts comment, refreshes, doesn't see it (stale cache served). Fix: Read-your-own-writes bypass — after writing, set a short Redis flag; next read for that user bypasses cache and queries DB primary directly. TTL (30-60s) self-heals regardless.

Comment Bombs & Deep Chains

Malicious user creates 500-deep chain or 10K spam comments on one post. Fix: Depth limit (50 levels enforced at API), three-layer rate limiting (per-user, per-user-per-post, per-post global), path length cap (2000 chars).

Kafka Consumer Lag

Score calculator can't keep up → Wilson scores become stale → sort order drifts visibly wrong. Fix: Auto-scaling workers based on consumer lag metrics. Priority processing (hot posts first). Fallback: compute Wilson scores in SQL on cache miss (expensive, but correct).

Orphaned Comments (Data Corruption)

Bug hard-deletes a parent → children's parent_id points nowhere → invisible orphans. Fix: Never hard-delete (soft-delete only). FK constraints prevent accidental removal. Periodic integrity audit job detects orphans and path inconsistencies.

09

Interview Tips

💡
Open with the core tension.
"We need to store, retrieve, rank, and paginate a tree structure at massive read scale where every node has a rapidly changing vote score." This immediately shows you understand the problem isn't CRUD.
Derive, don't pattern-match.
Calculate the 200:1 read-to-write ratio before reaching for Redis. Show the interviewer that each component exists because a number demanded it, not because "everyone uses Redis."
🎯
Explain Wilson score intuitively.
"Instead of asking what percentage of votes are upvotes, we ask: given the votes we've seen, what's the WORST the true approval rate could be?" Then show how it penalizes small sample sizes — 2 upvotes / 0 downvotes ranks below 100 upvotes / 10 downvotes.
🔑
Tree pagination is the differentiator.
Most candidates hand-wave "fetch from DB and sort." Explain the three-dimensional pagination problem (breadth × depth × sort), the level-by-level window function approach, and the three cursor types. This is the deep-dive that separates a strong answer from an average one.
⚖️
Articulate what you gave up.
For every decision, state the cost: "Precomputed scores mean up to 60s staleness. Eventual consistency means two users see different numbers. Level-by-level queries add 3 extra round trips." Interviewers want to see you understand tradeoffs, not just benefits.
🏗️
Start with single-server baseline.
"A single PostgreSQL instance handles 175 writes/sec easily. It breaks at 35K reads/sec — that's what forces us to add caching." This shows maturity: you don't over-engineer, you scale in response to measured bottlenecks.
11

Evolution

How this design grows from MVP to planet-scale.

1

MVP — Single Server

One PostgreSQL instance with adjacency list (parent_id only). Comments fetched with WITH RECURSIVE CTE, sorted in application code. Votes update the comment row directly. No cache. Handles ~1K reads/sec, sufficient for a small community.

2

Growth — Add Caching & Materialized Path

Add Redis for pre-assembled tree caching (10× read throughput). Add path column and switch to level-by-level queries. Precompute Wilson scores. Add read replicas. Move votes to Redis counters + async batch DB updates via a simple job queue. Handles ~10K reads/sec.

3

Scale — Full Distributed Architecture

Kafka for event-driven processing with independent consumer groups. Redis Cluster for partitioned caching. Hash-partitioned PostgreSQL by post_id. Thundering herd protection (distributed locks + stale backups). Auto-scaling workers, circuit breakers, load shedding. Handles ~35K+ reads/sec at Reddit scale.

4

Planet-Scale — Beyond Reddit

Multi-region deployment with regional PostgreSQL primaries and cross-region async replication. CDN edge caching with 10-15s TTL for viral posts. Separate read/write APIs (CQRS) if needed. ML-based comment ranking (beyond Wilson score). Real-time WebSocket push for live discussion threads. Sharded Kafka across regions.

Next up