Reddit-Style Comments

01

Problem Statement

Design a comment system like Reddit's, where users can write top-level comments on posts, reply to any comment (creating arbitrarily deep nested threads), upvote or downvote any comment, and view the comment tree sorted by different ranking algorithms (best, top, new, controversial).

Unlike flat comment systems (YouTube, Instagram), Reddit comments form a rooted, ordered, variable-depth tree. Every node at every level can be independently voted on and ranked. The tree can grow to tens of thousands of comments on viral posts while needing to load in under 200ms.

Core question: How do you store, retrieve, rank, and paginate a tree structure at massive read scale — where every node has a rapidly changing vote score?

What's in scope: Nested comments, voting, ranking, tree pagination, "load more replies", "continue this thread", soft deletion.

What's out of scope: Post system, subreddits, moderation tools, awards/gilding, user profiles, notification delivery internals, real-time WebSocket push.

02

Requirements

Functional Requirements

Create comment — top-level on a post, or reply to any existing comment (tree grows deeper)
Vote — upvote, downvote, or remove vote on any comment. One vote per user per comment, changeable.
View comment tree — fetch a sorted, truncated, nested tree for a post. Multiple sort orders: best, top, new, controversial, old.
Load more replies — expand truncated sibling branches. "Load more replies (42)" at any tree node.
Continue thread — navigate into deeply nested chains beyond the initial depth cutoff.
Edit & delete — author can edit body text or soft-delete (shows "[deleted]", preserves tree structure).

Non-Functional Requirements

Read latency < 200ms — comment tree load must feel instant. This drives the entire caching strategy.
Eventual consistency on votes — vote counts can lag by up to 60 seconds. Reddit itself fuzzes displayed scores. This is a gift architecturally.
Read-your-own-writes for comments — if a user posts a reply and refreshes, they must see it immediately.
High availability — reading comments must work even during partial infrastructure failures.
Bounded response size — max ~200 comments per response (~80 KB payload), regardless of total tree size.

The requirement that drives everything: Fetch a sorted, truncated, nested tree in under 200ms for a post that might have 50,000 comments.

03

Scale Estimation

Every number derived from assumptions — adjust the inputs and watch the architecture change.

~35K/s

Peak comment reads/sec

~175/s

Peak comment writes/sec

~2,600/s

Peak vote writes/sec

200:1

Read-to-write ratio

Derivation Chain

500M MAU → 50M DAU (10%) → 2% comment → 3M comments/day (~35/s avg, ~175/s peak at 5×).
9% of DAU vote, 10 votes each → 45M votes/day (~520/s avg, ~2,600/s peak).
80% of DAU view comments, 15 pages each → 600M page views/day (~7K/s avg, ~35K/s peak).

Storage

Metric	Value	Insight
Comment size (row)	~500 bytes	Text + metadata + indexes
Daily comment storage	1.5 GB/day	Not a sharding driver
Annual storage	~550 GB/year	Single DB can hold 5 years
Vote record size	~50 bytes	Grows faster than comments
Response payload	~80 KB	~200 comments × 400 bytes
Peak bandwidth	~2.8 GB/s	CDN + cache absorbs this

What the numbers tell us: The system is overwhelmingly read-heavy (200:1). Vote counting is the hottest write path, not comment creation. Storage is a non-issue — we shard for read throughput, not capacity.

04

API Design

Seven endpoints. The complexity isn't in the API surface — it's in the tree assembly logic behind the read endpoints.

Get Comment Tree (the hardest endpoint)

GET /api/v1/posts/{post_id}/comments
    ?sort=best          // best | top | new | controversial | old
    &depth=3            // max nesting depth to return
    &limit=20           // max top-level comments
    &child_limit=5      // max children per comment per level
    &cursor={token}     // opaque pagination token

→ 200 { data: { post_id, total_comments, sort, comments: [
    { id, user: {id, username, avatar_url}, body, depth,
      upvotes, downvotes, score, user_vote,
      child_count, has_more_children, children: [...] }
  ], pagination: { next_cursor, has_more } } }
      

Create Comment (top-level or reply)

POST /api/v1/posts/{post_id}/comments
Auth: Bearer {token}
Body: { parent_id: "uuid" | null, body: "text" }
→ 201 { data: { id, post_id, parent_id, user, body, depth, ... } }
      

Load More Replies

GET /api/v1/comments/{comment_id}/replies
    ?sort=best&limit=10&depth=3&cursor={token}
→ 200 { data: { parent_id, replies: [...], pagination: {...} } }
      

Continue Thread (deep chain)

GET /api/v1/comments/{comment_id}/thread?sort=best&limit=20
→ 200 { data: { ancestor_chain: [...], focus_comment: {...}, pagination } }
      

Vote on Comment

PUT /api/v1/comments/{comment_id}/vote
Auth: Bearer {token}
Body: { vote: 1 | -1 | 0 }    // upvote | downvote | remove
→ 200 { data: { comment_id, upvotes, downvotes, score, user_vote } }
      

Edit & Delete

PATCH /api/v1/comments/{id}   → body update, sets edited_at
DELETE /api/v1/comments/{id}  → soft delete: is_deleted=true, body="[deleted]"
      

Key Design Decisions

Cursor-based pagination

Offset pagination breaks on sorted, dynamic datasets. Cursors encode (wilson_score, id) for stable keyset pagination.

Server owns tree assembly

Client receives a pre-assembled, sorted, truncated nested JSON tree — not a flat list to reconstruct.

PUT for votes (idempotent)

One endpoint handles upvote, downvote, and removal. Upvoting twice = same effect. Simpler than separate endpoints.

Soft delete preserves tree

Hard-deleting a parent orphans all children. Soft delete keeps tree structure intact, showing "[deleted]" placeholder.

05

High-Level Architecture

Every component exists because a specific number demanded it. The 200:1 read-to-write ratio drives the cache-first design. The 2,600 votes/sec hot-key problem drives the Redis counter layer. The 35,000 reads/sec demands pre-assembled tree caching.

Component Summary

Component	Tech	Why It Exists
CDN	CloudFront	Absorb 2.8 GB/s bandwidth, short-TTL cache for comment pages
Load Balancer	ALB (L7)	Distribute across 18 stateless API servers, health checks
API Servers ×18	Go	Handle tree assembly, caching, auth. Sized for 35K reads/s
Redis Cluster ×6	Redis 7	Pre-assembled tree cache, vote counters, rate limits, user vote state
PostgreSQL	PG 16	Durable storage. 1 primary (writes) + 3 replicas (reads)
PgBouncer	—	Multiplex 360 app connections over 100 DB connections
Kafka	3-broker	Async: score recalculation, cache invalidation, notifications
Workers	Go	Score calculator (batch votes), cache invalidator, notification sender

Request Flow — Step Through

Client→CDN→Load Balancer→API Server→Redis Cache→PostgreSQL→Assemble Tree→Return

Click Next Step to walk through the request flow.

06

Deep Dive — Tree Retrieval & Pagination

This is the hardest engineering problem in the system. Flat list pagination is one-dimensional. Tree pagination operates in three dimensions simultaneously: breadth (siblings per level), depth (how many levels), and sort order (different at every level). No off-the-shelf database feature handles this.

Why the Naive Approach Fails

Fetching all 50,000 comments for a viral post and assembling in memory: 25 MB from DB, 40 MB in memory, 20 MB JSON response — but we only show ~200 comments. That's 250× more data than needed.

Our Approach: Top-Down Level-by-Level Fetching

Fetch only the comments we'll render, level by level, using window functions to select the top-N children per parent at each depth.

sequenceDiagram participant C as Client participant A as API Server participant R as Redis Cache participant DB as PostgreSQL C->>A: GET /posts/{id}/comments?sort=best A->>R: GET tree:{post_id}:best:page1 alt Cache HIT (90%) R-->>A: Compressed JSON tree A->>R: HGETALL user_votes:{user}:{post} R-->>A: Vote map A-->>C: Tree + user votes (3-5ms) else Cache MISS (10%) R-->>A: null A->>R: SET lock:tree:{post_id} NX EX 5 A->>DB: Top 20 top-level (wilson_score DESC) DB-->>A: 20 rows A->>DB: Top 5 children per parent (window fn) DB-->>A: ≤100 rows A->>DB: Top 3 grandchildren per parent DB-->>A: ≤300 rows A->>DB: Top 2 great-grandchildren per parent DB-->>A: ≤600 rows A->>A: Assemble tree (~1ms) A->>R: SETEX tree (45s TTL) A-->>C: Tree response (30-85ms) end

The Window Function Query

Each level uses a single batch query with ROW_NUMBER() OVER (PARTITION BY parent_id ORDER BY wilson_score DESC) to select the top-N children per parent — avoiding N separate queries.

SELECT * FROM (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY parent_id
            ORDER BY wilson_score DESC
        ) as rank
    FROM comments
    WHERE parent_id IN ('c1', 'c2', ... 'c20')
) ranked
WHERE rank <= 5;
      

Fetch Budget Per Level

Level	Parents	Children/Parent	Max Rows	Queries
0 (top-level)	—	20	20	1
1	20	5	100	1
2	≤100	3	300	1
3	≤300	2	600	1
Total			≤1,020	4

We fetch ≤1,020 comments in 4 queries vs. 50,000 in one — a 50× data reduction at the cost of 3 extra round trips.

Three Types of Cursors

Top-Level Pagination

Keyset cursor: (wilson_score, id) < (0.723, 'c20'). Loads the next 20 root comments with their subtrees.

Load More Siblings

Scoped to one parent_id. Same keyset pattern, fetching the next batch of children for a specific comment.

Continue Thread

Fresh subtree rooted at the target comment. Includes ancestor chain for breadcrumb context (free via materialized path).

Cache Strategy

Cache the generic tree without user_vote. Overlay each user's votes per-request from a separate Redis hash. One cache serves all users.

Data Model: Adjacency List + Materialized Path

We store parent_id for write simplicity (O(1) INSERT) and a path column for read power (prefix-based subtree queries, ancestor lookups without recursion). Best of both worlds — since comments are never re-parented, the path's biggest weakness doesn't apply.

CREATE TABLE comments (
    id              UUID PRIMARY KEY,
    post_id         UUID NOT NULL,
    parent_id       UUID REFERENCES comments(id),
    path            TEXT NOT NULL,        -- "uuid-A.uuid-B.uuid-C"
    user_id         UUID,
    body            TEXT NOT NULL,
    depth           INT NOT NULL DEFAULT 0,
    upvotes         INT NOT NULL DEFAULT 0,
    downvotes       INT NOT NULL DEFAULT 0,
    score           INT NOT NULL DEFAULT 0,
    wilson_score    FLOAT NOT NULL DEFAULT 0,
    controversial_score FLOAT NOT NULL DEFAULT 0,
    direct_child_count  INT NOT NULL DEFAULT 0,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    is_deleted      BOOLEAN NOT NULL DEFAULT false
);

-- Key indexes
CREATE INDEX idx_post_path ON comments (post_id, path);
CREATE INDEX idx_parent_wilson ON comments (parent_id, wilson_score DESC);
CREATE INDEX idx_post_wilson ON comments (post_id, wilson_score DESC);

CREATE TABLE comment_votes (
    user_id     UUID NOT NULL,
    comment_id  UUID NOT NULL REFERENCES comments(id),
    vote_type   SMALLINT NOT NULL,  -- 1, -1
    PRIMARY KEY (user_id, comment_id)
);
      

Why not NoSQL? A document store (MongoDB) hits the 16 MB doc limit on viral posts. Updating one comment's score requires read-modify-write on a multi-MB document. PostgreSQL's window functions, CTEs, and row-value comparisons are purpose-built for our tree queries.

07

Simpler — no Kafka, no workers, no consumer lag monitoring. But adds ~100ms to every write. Couples the comment API to every downstream system. Fine for MVP, painful at scale.

08

What Can Go Wrong

Cache Stampede on Viral Posts

Cache TTL expires on a post with 2M viewers → 35K simultaneous cache-miss rebuilds hit the database. Fix: Distributed lock (only one server rebuilds), stale-while-revalidate backup (serve slightly old data while rebuilding), adaptive TTL (shorter for hot posts, longer for cold).

Hot-Key Vote Contention

Top comment on a viral post: 500 votes/sec on one DB row → row-lock serialization, cascading timeouts. Fix: Redis HINCRBY for instant counting (100K ops/sec, no locks), Kafka + batch worker collapses 500 writes/sec into 1 batched UPDATE/sec.

Database Primary Failure

Primary crashes → all writes fail. Reads continue from cache + replicas. Fix: Automated failover with Patroni/pg_auto_failover (30-60s). Kafka retains unprocessed vote events until new primary is ready. Zero vote loss if Kafka has acks=all.

Cascading Timeout Collapse

DB goes from 10ms to 200ms queries → API timeouts → client retries → 3× load → total collapse. Fix: Explicit timeouts at every boundary, circuit breaker on DB connections (stop sending requests during recovery), exponential backoff with jitter on retries, load shedding (reject excess requests with 503).

Cache-Database Inconsistency

Cache invalidation worker lags → user posts comment, refreshes, doesn't see it (stale cache served). Fix: Read-your-own-writes bypass — after writing, set a short Redis flag; next read for that user bypasses cache and queries DB primary directly. TTL (30-60s) self-heals regardless.

Comment Bombs & Deep Chains

Malicious user creates 500-deep chain or 10K spam comments on one post. Fix: Depth limit (50 levels enforced at API), three-layer rate limiting (per-user, per-user-per-post, per-post global), path length cap (2000 chars).

Kafka Consumer Lag

Score calculator can't keep up → Wilson scores become stale → sort order drifts visibly wrong. Fix: Auto-scaling workers based on consumer lag metrics. Priority processing (hot posts first). Fallback: compute Wilson scores in SQL on cache miss (expensive, but correct).

Orphaned Comments (Data Corruption)

Bug hard-deletes a parent → children's parent_id points nowhere → invisible orphans. Fix: Never hard-delete (soft-delete only). FK constraints prevent accidental removal. Periodic integrity audit job detects orphans and path inconsistencies.

09

Interview Tips

💡

Open with the core tension.
"We need to store, retrieve, rank, and paginate a tree structure at massive read scale where every node has a rapidly changing vote score." This immediately shows you understand the problem isn't CRUD.

⚡

Derive, don't pattern-match.
Calculate the 200:1 read-to-write ratio before reaching for Redis. Show the interviewer that each component exists because a number demanded it, not because "everyone uses Redis."

🎯

Explain Wilson score intuitively.
"Instead of asking what percentage of votes are upvotes, we ask: given the votes we've seen, what's the WORST the true approval rate could be?" Then show how it penalizes small sample sizes — 2 upvotes / 0 downvotes ranks below 100 upvotes / 10 downvotes.

🔑

Tree pagination is the differentiator.
Most candidates hand-wave "fetch from DB and sort." Explain the three-dimensional pagination problem (breadth × depth × sort), the level-by-level window function approach, and the three cursor types. This is the deep-dive that separates a strong answer from an average one.

⚖️

Articulate what you gave up.
For every decision, state the cost: "Precomputed scores mean up to 60s staleness. Eventual consistency means two users see different numbers. Level-by-level queries add 3 extra round trips." Interviewers want to see you understand tradeoffs, not just benefits.

🏗️

Start with single-server baseline.
"A single PostgreSQL instance handles 175 writes/sec easily. It breaks at 35K reads/sec — that's what forces us to add caching." This shows maturity: you don't over-engineer, you scale in response to measured bottlenecks.

10

Evolution

How this design grows from MVP to planet-scale.

1

MVP — Single Server

One PostgreSQL instance with adjacency list (parent_id only). Comments fetched with WITH RECURSIVE CTE, sorted in application code. Votes update the comment row directly. No cache. Handles ~1K reads/sec, sufficient for a small community.

2

Growth — Add Caching & Materialized Path

Add Redis for pre-assembled tree caching (10× read throughput). Add path column and switch to level-by-level queries. Precompute Wilson scores. Add read replicas. Move votes to Redis counters + async batch DB updates via a simple job queue. Handles ~10K reads/sec.

3

Scale — Full Distributed Architecture

Kafka for event-driven processing with independent consumer groups. Redis Cluster for partitioned caching. Hash-partitioned PostgreSQL by post_id. Thundering herd protection (distributed locks + stale backups). Auto-scaling workers, circuit breakers, load shedding. Handles ~35K+ reads/sec at Reddit scale.

4

Planet-Scale — Beyond Reddit

Multi-region deployment with regional PostgreSQL primaries and cross-region async replication. CDN edge caching with 10-15s TTL for viral posts. Separate read/write APIs (CQRS) if needed. ML-based comment ranking (beyond Wilson score). Real-time WebSocket push for live discussion threads. Sharded Kafka across regions.

📺

References & Videos

Design Reddit

Jordan Has No Life · 30 min

r/RedditEng

Problem Statement

Requirements

Functional Requirements

Non-Functional Requirements

Scale Estimation

Derivation Chain

Storage

API Design

Key Design Decisions

Cursor-based pagination

Server owns tree assembly

PUT for votes (idempotent)

Soft delete preserves tree

High-Level Architecture

Component Summary

Deep Dive — Tree Retrieval & Pagination

Why the Naive Approach Fails

Our Approach: Top-Down Level-by-Level Fetching

The Window Function Query

Fetch Budget Per Level

Three Types of Cursors

Top-Level Pagination

Load More Siblings

Continue Thread

Cache Strategy

Data Model: Adjacency List + Materialized Path

Key Design Decisions & Tradeoffs

Tree Storage Model

Adjacency List + Materialized Path

Closure Table / Nested Sets

Sort Score Computation

Precomputed + Async Updated

Compute on Read

Vote Consistency Model

Eventual Consistency

Strong Consistency

Caching Strategy

Cache Full Pre-Assembled Trees

Cache Individual Comments

Tree Fetch Strategy

Level-by-Level Window Queries

Single Recursive CTE

Async Processing

Kafka Event Queue

Synchronous Processing

What Can Go Wrong

Cache Stampede on Viral Posts

Hot-Key Vote Contention

Database Primary Failure

Cascading Timeout Collapse

Cache-Database Inconsistency

Comment Bombs & Deep Chains

Kafka Consumer Lag

Orphaned Comments (Data Corruption)

Interview Tips

Similar Problems

Twitter / News Feed

WhatsApp / Chat System

Notification System

Rate Limiter

Evolution

MVP — Single Server

Growth — Add Caching & Materialized Path

Scale — Full Distributed Architecture

Planet-Scale — Beyond Reddit

References & Videos

Twitter / News Feed

WhatsApp / Chat System

SQL vs NoSQL