(a) Hot Ranking Algorithm. Reddit's classic formula:
hot_score = log10(max(|ups - downs|, 1)) * sign(ups - downs)
+ (post_created_utc - epoch) / 45000
The key insight: the time component is absolute, not relative. A post from today starts with a higher baseline than one from yesterday. No need to re-score old posts -- they naturally sink as newer posts have higher time baselines. The log of votes means the first 10 upvotes matter as much as the next 100 -- this prevents runaway posts from dominating forever. The sign function means downvoted posts (controversial) rank below zero-vote posts.
This makes the algorithm O(1) per post at write time (recompute score on each vote) and O(0) maintenance -- no background recalculation, no scanning all posts. The score is stored in a Redis sorted set per subreddit. Feed reads are ZREVRANGE -- O(log N + K) where K is page size (25).
-- Compute hot score on vote
import math, time
def hot_score(ups, downs, created_utc):
score = ups - downs
order = math.log10(max(abs(score), 1))
sign = 1 if score > 0 else -1 if score < 0 else 0
seconds = created_utc - 1134028003 # Reddit epoch
return round(sign * order + seconds / 45000, 7)
(b) Comment Tree Storage — Materialized Path. Each comment stores its full ancestry path, e.g., /root/abc/def/self.
-- Fetch entire subtree of comment 'abc' in one query:
SELECT * FROM comments
WHERE post_id = :post_id
AND path LIKE '/root/abc/%'
ORDER BY path;
This avoids N+1 recursive queries of an adjacency list. The path column is indexed with a B-tree prefix index. Within each depth level, comments are sorted by Wilson score confidence interval (lower bound of 95% CI on upvote proportion):
-- Wilson score lower bound (simplified)
-- n = total votes, p = fraction of upvotes, z = 1.96 (95% CI)
wilson = (p + z*z/(2*n) - z * sqrt((p*(1-p) + z*z/(4*n))/n))
/ (1 + z*z/n)
This correctly handles the "1 upvote / 0 downvote vs 100 upvotes / 50 downvotes" ranking problem. A comment with few but all-positive votes ranks appropriately against a controversial comment with many total votes. The score is computed at vote time and stored alongside the comment for fast retrieval.
(c) Vote Counting Pipeline. 50K votes/sec cannot hit Postgres directly (hot-row contention). Three-layer approach:
- Dedup: Redis SET per post (
SADD post:{id}:voters {user_id}). If already a member, reject duplicate vote. For high-cardinality posts, use a Bloom filter instead.
- Fast counter: Redis
INCR post:{id}:score. Sub-ms. This is the counter used for hot ranking and display.
- Durability: Vote event published to Kafka. A consumer aggregates batches (every 5 s or 1000 votes) and flushes to Postgres. On Redis crash, Postgres is the recovery source.
Vote Pipeline SequenceMermaid
sequenceDiagram
participant U as User
participant VS as Vote Service
participant R as Redis
participant K as Kafka
participant A as Aggregator
participant PG as Postgres
participant FS as Feed Service
U->>VS: POST /posts/{id}/vote {dir:1}
VS->>R: SADD post:123:voters user42
R-->>VS: 1 (new vote)
VS->>R: INCR post:123:score
R-->>VS: 1547
VS->>K: VoteEvent{post:123, user:42, dir:1}
VS-->>U: {new_score: 1547}
K->>A: batch of vote events
A->>PG: INSERT votes + UPDATE post score
A->>FS: trigger feed re-rank for affected subreddit
Moderation Pipeline. Each subreddit has moderators with tools to remove posts, ban users, and configure AutoMod rules (regex-based filters). Moderation actions are synchronous -- a removed post disappears from the feed immediately (delete from Redis sorted set). AutoMod runs as a pre-save hook on Post Service: if a post matches any rule, it is flagged or auto-removed before entering the feed pipeline. Mod actions are audit-logged to a separate Postgres table for transparency (mod log is public on Reddit).
Home Feed Construction. A user subscribed to 50 subreddits needs a merged, ranked feed. Two strategies:
- Fan-out on write: when a post is created, push its ID to every subscriber's feed (Redis list). Fast reads, but a subreddit with 10M subscribers = 10M writes per post. Only viable for small subreddits.
- Fan-out on read: at read time, fetch the top N posts from each subscribed subreddit's sorted set, merge-sort by hot score, return top 25. Slower reads, but no write amplification. Used for large subreddits.
Reddit's hybrid: small subreddits (< 100K subscribers) fan-out on write; large subreddits fan-out on read. The Feed Service maintains a bitmap per subreddit to decide which strategy to use.
Search. Posts and comments are indexed into Elasticsearch via Kafka consumers. Index structure: one index per month (time-based rollover for efficient retention). Queries support subreddit scoping, author filtering, date ranges, and flair-based filtering. Autocomplete uses an edge-ngram tokenizer on post titles. Search ranking combines text relevance (BM25) with engagement signals (vote count, comment count) to surface popular results over obscure matches.
Karma Calculation. A user's karma = sum of net votes on all their posts and comments, with diminishing returns per post (first 10 upvotes count fully, next 100 count ~50%, etc.). Karma is computed async by a Kafka consumer that listens to vote events. The running total is cached in a Redis hash per user (HINCRBY user:42:karma post_karma 1). Displayed karma may lag actual karma by up to 60 seconds -- acceptable for a vanity metric. Reddit also separates post karma from comment karma, stored as two fields in the same hash.
Data Model Summary:
posts: {id, subreddit_id, author_id, title, body, type, score, hot_score, created_at}
comments: {id, post_id, author_id, body, path, wilson_score, ups, downs, created_at}
votes: {user_id, target_id, target_type, direction, created_at} -- PK: (user_id, target_id)
users: {id, username, post_karma, comment_karma, created_at}
Interview answer
"Posts are ranked by Reddit's Hot formula: log10(votes) + time_created/45000. Time is absolute so old posts decay without re-scoring. Comments use materialized path storage -- one indexed LIKE query fetches an entire subtree, sorted by Wilson score. Votes go through Redis for dedup (SET) and counting (INCR), then Kafka to a batch aggregator that flushes to Postgres. Redis is the fast path for reads; Postgres is the durable source of truth. Feed is a pre-computed Redis sorted set per subreddit, updated when vote aggregation completes. Home feed uses a hybrid fan-out: write-based for small subs, read-based merge-sort for large subs."