System Design — 016

Top-K Leaderboard

Design a real-time ranking system that maintains sorted standings of millions of users, serving both "who's on top?" and "what's my rank?" with sub-50ms latency.

Redis ZSETSkip ListReal-Time RankingScore PackingCQRS
01

Problem Statement

Design a system that maintains and serves real-time leaderboards — think gaming leaderboards (Fortnite, PUBG), competitive platforms (LeetCode, Kaggle), or engagement rankings (Spotify Wrapped). Users perform actions that update their scores, and the system must efficiently answer "who are the top K?" and "what's my rank?" at any point.

The naive approach — SELECT * ORDER BY score DESC — works for 1,000 users. At 100 million users, re-sorting on every update is a non-starter. This is a data structure problem disguised as a system design problem. The choice of data structure determines your latency, memory footprint, and sharding strategy.

Core question: How do you maintain a globally sorted ranking of millions of entities that updates in real-time, while serving both top-K queries and individual rank lookups with low latency?

The Two Queries That Drive the Design

Top-K: "Give me the top 100 players." A range query on the head of a sorted set — relatively easy. Most sorted data structures answer this efficiently.

My Rank: "What rank am I?" This is the hard one. It asks "how many entities have a score higher than mine?" — a full table scan in a flat database. This single query is what separates a toy solution from a real one, and it pushes the entire design toward a skip list (Redis ZSET).

02

Requirements

Functional Requirements

  • Update a user's score — supports three modes: increment (add to current), replace (overwrite), and max (keep highest). Increment is the hardest — needs atomic read-modify-write.
  • Get Top-K users — return the K highest-scoring users with scores, real-time, with timestamp-based tiebreaking (earliest = higher rank for same score).
  • Get a user's rank — O(log N) rank lookup with optional surrounding window (show 5 users above and below).
  • Scoped leaderboards — support global, daily, weekly, monthly, regional, and tournament-specific leaderboards as independent sorted sets.
  • Profile enrichment — leaderboard engine stores (user_id, score) only. API layer enriches with display names, avatars, tier badges from a separate Profile Service.

Non-Functional Requirements

  • Low read latencyp99 < 50ms for top-K and rank queries. Data structure must live in memory.
  • High write throughput — tens of thousands of score updates/sec. Increment must be atomic (Redis ZINCRBY).
  • Eventual consistency on ranks — target lag < 5 seconds. Strong consistency on score accuracy — no lost updates.
  • High availability (99.9%+) — Redis Sentinel for failover. PostgreSQL as durable backing store for rebuild.
  • Scalability to 100M+ users — single ZSET fits on one shard (~11 GB). Redis Cluster for total memory capacity.
  • Idempotent writes — every score update carries a unique event ID. Duplicates are detected and skipped.

Key tension: Low read latency (in-memory) conflicts with score durability (disk-based). Resolution: Redis as the fast path, PostgreSQL as the safety net. Both are written on every update.

03

Scale Estimation

Anchored on a competitive gaming platform at Fortnite/PUBG scale. Every number is derived from the previous one.

100M
Registered Users
20M
DAU (20%)
5,600/s
Peak Writes
8,400/s
Peak Reads
1.5 : 1
Read:Write Ratio
~60 GB
Total Redis Memory
~1,100
Active Leaderboards
4 shards
Redis Cluster Size

Write Path Derivation

20M DAU × 6 matches/day ÷ 86,400 sec = 1,400 writes/sec average. Peak (4×) = 5,600/sec. Event spikes (10×) = ~14,000/sec. A single Redis instance handles 150K+ ops/sec — throughput is not our bottleneck.

Memory Derivation

Each ZSET entry ≈ 110 bytes (skip list node + hash table entry + 16-byte UUID member). Global leaderboard: 100M × 110B = 11 GB. All scopes combined: ~60 GB. Doesn't fit on one node → Redis Cluster with 4 shards × 32 GB = 128 GB total, using ~47%.

Critical Insight

The read:write ratio is nearly balanced (1.5:1) — unusual for most systems. This means write optimization matters as much as read. We can't just "cache everything and forget about writes." However, top-K is highly cacheable (same for all users), while rank lookups must hit Redis directly.

04

API Design

Update Score
POST /v1/leaderboards/{leaderboard_id}/scores
Headers: Idempotency-Key: <unique_event_id>      // REQUIRED

{
  "user_id":  "usr_a1b2c3d4",
  "score":    150,
  "mode":     "increment"     // "increment" | "replace" | "max"
}

→ 200 { "new_score": 2450, "previous_score": 2300,
        "idempotent_replay": false }

mode maps to Redis commands: increment → ZINCRBY, replace → ZADD, max → ZADD GT. The response does not include rank — computing rank is a separate O(log N) operation that would slow the write path.

Get Top-K
GET /v1/leaderboards/{leaderboard_id}/top?limit=100&cursor=null

→ 200 {
  "entries": [
    { "rank": 1, "user_id": "usr_x9y8z7", "score": 98750,
      "user_profile": { "display_name": "ShadowKnight", ... } },
    ...
  ],
  "next_cursor": "eyJvZmZzZXQiOjEwMH0=",
  "total_users": 43829104
}

Uses cursor-based pagination, not offset. Offset on a live sorted set causes skips/duplicates between pages. The cursor encodes the last score+user_id seen. Highly cacheable — CDN with 5-second TTL absorbs 99.99% of top-K traffic.

Get User Rank
GET /v1/leaderboards/{leaderboard_id}/users/{user_id}/rank?window=5

→ 200 {
  "rank": 48231, "score": 2450, "percentile": 99.89,
  "window": {
    "above": [ { "rank": 48229, "score": 2455 }, ... ],
    "below": [ { "rank": 48232, "score": 2448 }, ... ]
  }
}

Internally: ZREVRANK (O(log N)) + ZREVRANGE for the window. Percentile is derived: (total - rank) / total × 100. Not cacheable — per-user, must hit Redis directly.

Batch Score Update (Server-to-Server)
POST /v1/leaderboards/{leaderboard_id}/scores/batch
{ "updates": [
    { "user_id": "usr_a1b2", "score": 150, "mode": "increment" },
    { "user_id": "usr_e5f6", "score": 200, "mode": "increment" },
    ...   // up to 1000 per batch
] }

→ 200 { "processed": 100, "results": [...] }

A single match produces 100 score updates. Redis pipeline sends all ZINCRBY commands in one round trip — sub-millisecond on Redis side.

EndpointRedis OpLatencyCacheable
Update ScoreZINCRBY / ZADD< 5msNo
Get Top-KZREVRANGE< 10msYes (5s TTL)
Get User RankZREVRANK + ZREVRANGE< 10msNo
Batch UpdatePipeline ZINCRBY< 20msNo
Remove UserZREM< 5msNo
05

High-Level Architecture

Built on CQRS-lite: separate Read and Write services, Redis as the fast ranking engine, PostgreSQL as the durable source of truth. Each leaderboard is one ZSET on one shard — no cross-shard queries.

Clients Game Server / Web / Mobile CDN Cache Top-K · 5s TTL API Gateway Auth · Rate Limit Read Service Ranking Queries Write Service Score Ingestion Profile Service Name · Avatar · Tier Redis Cluster 4 Shards · ZSET per LB · ~60 GB Kafka score.updated events PostgreSQL Source of Truth · Events + Scores Consumer Workers Fan-out · Notifications Reconciliation Drift fix · Every 15 min reads writes miss ZREVRANK enrich ZINCRBY INSERT event fan-out compare

Key Architectural Principles

1. Redis is the fast path, PostgreSQL is the safety net. If Redis dies, we rebuild from PostgreSQL in minutes. If PostgreSQL dies, rankings keep working from Redis.

2. One leaderboard = one ZSET = one shard. No cross-shard queries. Every ranking operation is a single-node operation.

3. Cache the head, query the tail. Top-K is identical for everyone — cache aggressively. Rank lookups are per-user — serve live from Redis.

4. Async fan-out for secondary leaderboards. Primary leaderboard updated synchronously. Daily/weekly/regional updated via Kafka consumers.

Request Flow — Step Through
Game ClientAPI GatewayWrite ServiceIdempotency CheckPostgreSQLRedis ZSETKafkaConsumer Workers
Click Next Step to walk through the request flow.
06

Deep Dive — Real-Time Ranking at Scale

The entire leaderboard design rests on one data structure: the skip list inside Redis ZSET. Let's understand exactly why it's perfect for ranking, how it achieves O(log N) on all operations, and what happens when it's not enough.

Why Ranking Is a Data Structure Problem

You have 100M scores. You need two queries: top-K and "what's my rank?" A flat array needs O(n log n) sort for top-K and O(n) scan for rank. A B-Tree (database index) can do top-K efficiently but rank still requires a range scan — the B-Tree doesn't store subtree sizes. The answer: a data structure that maintains sorted order incrementally with position tracking.

The Skip List — How Redis ZSET Works

A skip list is a layered linked list. Level 0 is a sorted linked list of all elements. Higher levels are "express lanes" that skip over multiple elements. Think of it like a subway — local train (level 0), express (level 1), super-express (level 2). Search starts at the highest level and drops down.

sequenceDiagram participant C as Game Client participant GW as API Gateway participant WS as Write Service participant R as Redis ZSET participant PG as PostgreSQL participant K as Kafka participant CW as Consumer Worker C->>GW: POST /scores (score=150) GW->>WS: Route (authenticated) WS->>WS: Check idempotency key WS->>PG: INSERT score_event PG-->>WS: OK (durable) WS->>R: ZINCRBY lb:global 150 user_a R-->>WS: new_score = 2450 WS->>K: Publish score.updated WS-->>C: 200 { new_score: 2450 } K->>CW: Consume event CW->>R: ZINCRBY lb:weekly 150 user_a CW->>R: ZINCRBY lb:daily 150 user_a

The Key Augmentation — Span Tracking

Each forward pointer in the skip list stores a span — the number of elements it skips over. To find rank, traverse from HEAD at the highest level, accumulating spans as you go. For 100M elements with 32 levels, that's ~27 hops. Sub-microsecond on modern hardware.

ZSET = Skip List + Hash Table

The skip list gives O(log N) sorted operations. The hash table maps member → (score, node pointer) for O(1) score lookup. Together: ZADD O(log N), ZSCORE O(1), ZREVRANK O(log N), ZREVRANGE O(log N + K).

The Tiebreaker Trick — Score Packing

Redis sorts by one dimension. For tiebreaking, pack score + inverted timestamp into one float64:

composite = score × 10⁷ + (MAX_TIMESTAMP − actual_timestamp)

// Alice scores 2450 at t=1712345600 → 24502390099200
// Bob   scores 2450 at t=1712345700 → 24502390099100
// Alice's composite is higher → Alice ranks above Bob ✓

// Extract raw score: floor(composite / 10⁷) = 2450

IEEE 754 doubles have 52 bits of mantissa — integer precision up to 9 × 10¹⁵. Our composite uses at most 15 digits. Safely within range.

When One ZSET Isn't Enough — Sharding Strategies

A. Score-Range Sharding

Shard by score range. Rank = ZCARD of higher shards + rank within own shard. Fast rank queries. Uneven distribution, cross-shard migration on score change.

B. Hash-Based Sharding

Hash users across shards. Top-K = query all shards, merge. Rank = ZCOUNT on all shards. Even distribution. Every rank query fans out to all shards.

C. Tiered Approach (Production Answer)

Tier 1: ZSET of top 1M users (~110 MB). Exact ranking, fast, unsharded.
Tier 2: Score histogram (Fenwick tree) for all users. "How many users have score > X?" as a prefix sum in O(log B).
Result: exact rank for top players, approximate percentile for the rest. This is how Xbox Live and Steam work at scale.

The Decision Tree

UsersRanking EnginePrecision
< 1MPostgreSQL with B-Tree indexExact (COUNT query fast enough)
1M – 100MSingle Redis ZSETExact for all users
100M – 1BTiered: ZSET + histogramExact top-1M, percentile rest
1B+Streaming top-K (Space-Saving)Exact top-1K, approximate rest
07

Key Design Decisions & Tradeoffs

Ranking Engine

✓ Chosen

Redis ZSET (Skip List)

O(log N) rank + update. In-memory, sub-millisecond. All operations map to native Redis commands. No custom code needed.

✗ Alternative

PostgreSQL B-Tree

COUNT(*) range scan for rank — O(N) at 100M rows ≈ 200ms. Fine under 1M users. Beyond that, exceeds latency budget.

Write Path

✓ Chosen

Synchronous Dual-Write

PostgreSQL first, then Redis. Response returns after both succeed. User immediately sees updated rank. Simple, predictable.

✗ Alternative

Event-Driven (Kafka Buffer)

Write to Kafka, consume into both stores. Better burst handling. But 50-200ms lag before rank updates — user sees stale rank.

Leaderboard Scoping

✓ Chosen

Independent ZSET per Scope

Each time scope (daily, weekly) is a separate ZSET. O(log N) reads per scope. Memory cost: ~60 GB total for score duplication.

✗ Alternative

Computed Scoped Rankings

Store once, filter at query time. No duplication, but O(N) filter + sort per scoped query. Impractical at 100M entries.

Rank Precision

✓ Chosen

Exact Rank for All Users

100M-entry ZSET fits on one shard (11 GB). ZREVRANK is O(log N) regardless of rank position. Better UX — exact numbers motivate.

✗ Alternative

Tiered (Exact Top + Percentile)

Needed at 500M+ users when ZSET exceeds shard capacity. Our API already returns percentile — backward-compatible migration path.

Friends Leaderboard

✓ Chosen

Computed On Demand

Fetch friend list, pipeline ZSCORE for each friend (~200 O(1) calls), sort in-memory. Total: ~5ms. No write amplification.

✗ Alternative

Pre-Computed Per-User ZSET

Each score update fans out to 200 friends' ZSETs = 200× write amplification. At 5,600 writes/sec → 1.12M ZSET writes/sec. Absurd.

Tiebreaking

✓ Chosen

Composite Score Packing

Pack score + inverted timestamp into float64. Redis sorts natively. No extra queries. Tiebreaker is invisible to API consumer.

✗ Alternative

Application-Level Secondary Sort

Store raw score. Fetch tied users' timestamps from PostgreSQL at query time. Adds latency, complexity, and a DB dependency on the read path.

08

What Can Go Wrong

Redis Master Crashes

Downtime: 5-10 seconds. Redis Sentinel promotes the replica. Async replication means 0-100ms of writes may be lost — but PostgreSQL has them. Reconciliation worker fixes drift within 15 minutes. Mitigation: use WAIT 1 100 for semi-sync replication to eliminate the data loss window.

Score Drift Between Redis and PostgreSQL

Caused by partial write failures, replication lag during failover, or consumer bugs. Reconciliation worker samples 10,000 users every 15 minutes, overwrites Redis with PostgreSQL values on mismatch. For high-stakes queries (user checking own rank), the Read Service can do an inline validation — compare Redis score with PostgreSQL, trigger immediate correction if they differ (+5ms).

Hot Partition — Viral Leaderboard Spike

A streamer with 5M followers says "check the leaderboard!" — 500K rank queries in 10 seconds. CDN absorbs top-K reads. Redis handles 50K ZREVRANK ops/sec (within capacity). Read Service auto-scales. Profile Service circuit breaker opens — serve ranks without profile enrichment if needed. Graceful degradation, not failure.

Idempotency Store Loss

If the Redis SET holding idempotency keys is lost, duplicate score updates get applied. Fallback: Write Service checks PostgreSQL score_events table for the event_id before applying. Slower (~5ms vs 0.1ms) but correct. Rebuild idempotency keys from recent events via bulk SADD.

Cheater Injection at Scale

Fraudulent scores from compromised game servers. Defense layers: IP-restricted service tokens, per-account anomaly detection (score velocity, geographic impossibility), and surgical rollback — reverse exactly the fraudulent events from the event log without affecting legitimate players.

Full Redis Cluster Loss (Disaster Recovery)

Recovery: 15-45 minutes. Serve stale CDN cache for top-K. Provision new cluster. Rebuild ZSETs from PostgreSQL user_scores (bulk ZADD ~1M entries/sec). Replay recent events from score_events. Faster path: restore from hourly RDB snapshot in S3 → recovery in 5-10 minutes.

FailureDetectionRecoveryData Risk
Redis master crash5 sec5-10 sec0-100ms writes
Score drift15 minAuto-fixNone (PG is truth)
Hot partitionSeconds30-60 sec autoscaleNone
Idempotency lossMinutesSeconds (fallback)Possible duplicates
Cheater injectionMin-hoursMinutes (rollback)Integrity corruption
Full cluster lossImmediate15-45 minLast seconds
09

Interview Tips

💡
Start with the two queries, not the architecture.
Clarify first: does the system need top-K only, or top-K plus arbitrary rank lookup? Top-K alone can use a min-heap or periodic sort. Rank lookup is what forces you toward skip lists. Frame Redis ZSET as a consequence of the problem, not a starting assumption.
Derive scale, don't declare it.
Bad: "Let's say 10K writes/sec." Good: "20M DAU × 6 matches × 20-min avg = 120M updates/day = 1,400/sec avg, 5,600/sec peak." Every number connects to the previous through a chain of defensible assumptions.
🎯
Explain why PostgreSQL alone doesn't work before introducing Redis.
Walk through: "COUNT(*) WHERE score > X on 100M rows with a B-Tree index ≈ 200ms. Budget is 50ms. That gap is what pushes us to an in-memory skip list." Then add: "Under 1M users, PostgreSQL is absolutely fine — Redis is only necessary at scale."
🏆
The tiebreaker question is your chance to shine.
Crisp answer: "Pack score × 10⁷ + (MAX_TS − actual_ts) into one float64. Earliest timestamp = higher composite for equal scores. Redis sorts natively. IEEE 754 gives 52 bits of mantissa — safely within precision for scores up to 10⁸."
📐
Know your sharding boundary and be honest.
"This works for 100M users. Beyond that, the ZSET can't fit on one shard. I'd move to tiered ranking — exact for top 1M, percentile for the rest. The API already returns percentile, so it's backward-compatible." Knowing where your design breaks is a sign of maturity.
🛡️
Don't forget anti-cheat — it separates senior from mid-level.
Scores from game servers only (never client-direct). Anomaly detection on event stream. Surgical rollback via event log. Rate limiting on score submission endpoints. This shifts the conversation from infrastructure to product thinking.
⚠️
Common mistakes to avoid.
• Saying "Redis handles everything" without proving it with throughput/memory numbers.
• Not addressing what happens when Redis goes down (always mention PostgreSQL backing store).
• Pre-computing friends leaderboards (walk through the 200× write amplification math).
• Over-engineering consistency — eventual is genuinely fine for leaderboards.
11

Evolution

How this design grows from MVP to planet-scale. Each stage is triggered by a specific, measurable constraint — not by ambition.

1

MVP — Single Server, Single Redis

< 1M users, < 500 writes/sec. One API server, one Redis instance (2 GB), one PostgreSQL. Can even skip Redis and use PostgreSQL COUNT(*) directly — returns in 5-10ms at this scale. One engineer, one week to build.

2

Production — Read/Write Split, Replication, CDN

< 50M users, < 5K writes/sec. Separate Read and Write services. Redis master-replica for failover. CDN caches top-K (5s TTL). Kafka for async fan-out to secondary leaderboards. Idempotency layer + reconciliation worker. This is where most companies stop — it's enough.

3

Scale — Redis Cluster, Batch Ingestion

< 100M users, ~60 GB Redis. 4-shard Redis Cluster for memory capacity (not throughput). Batch score endpoint with Redis pipelines. Profile enrichment as a separate service. Composite score packing for tiebreaking. Admin moderation endpoints. This is our target architecture.

4

Large Scale — Tiered Ranking, Multi-Region

< 1B users, < 50K writes/sec. Top-1M in a small ZSET (exact rank), rest via score histogram (percentile). Multi-region deployment: writes to primary region, cross-region Kafka replication, local reads. Kafka as write buffer (202 Accepted, ~100-500ms rank lag).

5

Planet Scale — Streaming Top-K, OLAP

1B+ users, 100K+ writes/sec. Space-Saving algorithm for approximate top-K with O(1) updates. OLAP engine (Druid, ClickHouse) for multi-dimensional filtered rankings. Exact rank only for top 1K. Everyone else gets percentile. This is Spotify Wrapped / Xbox Live territory.

Next up