Top-K Leaderboard

01

Problem Statement

Design a system that maintains and serves real-time leaderboards — think gaming leaderboards (Fortnite, PUBG), competitive platforms (LeetCode, Kaggle), or engagement rankings (Spotify Wrapped). Users perform actions that update their scores, and the system must efficiently answer "who are the top K?" and "what's my rank?" at any point.

The naive approach — SELECT * ORDER BY score DESC — works for 1,000 users. At 100 million users, re-sorting on every update is a non-starter. This is a data structure problem disguised as a system design problem. The choice of data structure determines your latency, memory footprint, and sharding strategy.

Core question: How do you maintain a globally sorted ranking of millions of entities that updates in real-time, while serving both top-K queries and individual rank lookups with low latency?

The Two Queries That Drive the Design

Top-K: "Give me the top 100 players." A range query on the head of a sorted set — relatively easy. Most sorted data structures answer this efficiently.

My Rank: "What rank am I?" This is the hard one. It asks "how many entities have a score higher than mine?" — a full table scan in a flat database. This single query is what separates a toy solution from a real one, and it pushes the entire design toward a skip list (Redis ZSET).

02

Requirements

Functional Requirements

Update a user's score — supports three modes: increment (add to current), replace (overwrite), and max (keep highest). Increment is the hardest — needs atomic read-modify-write.
Get Top-K users — return the K highest-scoring users with scores, real-time, with timestamp-based tiebreaking (earliest = higher rank for same score).
Get a user's rank — O(log N) rank lookup with optional surrounding window (show 5 users above and below).
Scoped leaderboards — support global, daily, weekly, monthly, regional, and tournament-specific leaderboards as independent sorted sets.
Profile enrichment — leaderboard engine stores (user_id, score) only. API layer enriches with display names, avatars, tier badges from a separate Profile Service.

Non-Functional Requirements

Low read latency — p99 < 50ms for top-K and rank queries. Data structure must live in memory.
High write throughput — tens of thousands of score updates/sec. Increment must be atomic (Redis ZINCRBY).
Eventual consistency on ranks — target lag < 5 seconds. Strong consistency on score accuracy — no lost updates.
High availability (99.9%+) — Redis Sentinel for failover. PostgreSQL as durable backing store for rebuild.
Scalability to 100M+ users — single ZSET fits on one shard (~11 GB). Redis Cluster for total memory capacity.
Idempotent writes — every score update carries a unique event ID. Duplicates are detected and skipped.

Key tension: Low read latency (in-memory) conflicts with score durability (disk-based). Resolution: Redis as the fast path, PostgreSQL as the safety net. Both are written on every update.

03

Scale Estimation

Anchored on a competitive gaming platform at Fortnite/PUBG scale. Every number is derived from the previous one.

100M

Registered Users

20M

DAU (20%)

5,600/s

Peak Writes

8,400/s

Peak Reads

1.5 : 1

Read:Write Ratio

~60 GB

Total Redis Memory

~1,100

Active Leaderboards

4 shards

Redis Cluster Size

Write Path Derivation

20M DAU × 6 matches/day ÷ 86,400 sec = 1,400 writes/sec average. Peak (4×) = 5,600/sec. Event spikes (10×) = ~14,000/sec. A single Redis instance handles 150K+ ops/sec — throughput is not our bottleneck.

Memory Derivation

Each ZSET entry ≈ 110 bytes (skip list node + hash table entry + 16-byte UUID member). Global leaderboard: 100M × 110B = 11 GB. All scopes combined: ~60 GB. Doesn't fit on one node → Redis Cluster with 4 shards × 32 GB = 128 GB total, using ~47%.

Critical Insight

The read:write ratio is nearly balanced (1.5:1) — unusual for most systems. This means write optimization matters as much as read. We can't just "cache everything and forget about writes." However, top-K is highly cacheable (same for all users), while rank lookups must hit Redis directly.

04

API Design

Update Score

        POST /v1/leaderboards/{leaderboard_id}/scores
Headers: Idempotency-Key: <unique_event_id>      // REQUIRED

{
  "user_id":  "usr_a1b2c3d4",
  "score":    150,
  "mode":     "increment"     // "increment" | "replace" | "max"
}

→ 200 { "new_score": 2450, "previous_score": 2300,
        "idempotent_replay": false }
      

mode maps to Redis commands: increment → ZINCRBY, replace → ZADD, max → ZADD GT. The response does not include rank — computing rank is a separate O(log N) operation that would slow the write path.

Get Top-K

        GET /v1/leaderboards/{leaderboard_id}/top?limit=100&cursor=null

→ 200 {
  "entries": [
    { "rank": 1, "user_id": "usr_x9y8z7", "score": 98750,
      "user_profile": { "display_name": "ShadowKnight", ... } },
    ...
  ],
  "next_cursor": "eyJvZmZzZXQiOjEwMH0=",
  "total_users": 43829104
}
      

Uses cursor-based pagination, not offset. Offset on a live sorted set causes skips/duplicates between pages. The cursor encodes the last score+user_id seen. Highly cacheable — CDN with 5-second TTL absorbs 99.99% of top-K traffic.

Get User Rank

        GET /v1/leaderboards/{leaderboard_id}/users/{user_id}/rank?window=5

→ 200 {
  "rank": 48231, "score": 2450, "percentile": 99.89,
  "window": {
    "above": [ { "rank": 48229, "score": 2455 }, ... ],
    "below": [ { "rank": 48232, "score": 2448 }, ... ]
  }
}
      

Internally: ZREVRANK (O(log N)) + ZREVRANGE for the window. Percentile is derived: (total - rank) / total × 100. Not cacheable — per-user, must hit Redis directly.

Batch Score Update (Server-to-Server)

        POST /v1/leaderboards/{leaderboard_id}/scores/batch
{ "updates": [
    { "user_id": "usr_a1b2", "score": 150, "mode": "increment" },
    { "user_id": "usr_e5f6", "score": 200, "mode": "increment" },
    ...   // up to 1000 per batch
] }

→ 200 { "processed": 100, "results": [...] }
      

A single match produces 100 score updates. Redis pipeline sends all ZINCRBY commands in one round trip — sub-millisecond on Redis side.

Endpoint	Redis Op	Latency	Cacheable
Update Score	ZINCRBY / ZADD	< 5ms	No
Get Top-K	ZREVRANGE	< 10ms	Yes (5s TTL)
Get User Rank	ZREVRANK + ZREVRANGE	< 10ms	No
Batch Update	Pipeline ZINCRBY	< 20ms	No
Remove User	ZREM	< 5ms	No

05

High-Level Architecture

Built on CQRS-lite: separate Read and Write services, Redis as the fast ranking engine, PostgreSQL as the durable source of truth. Each leaderboard is one ZSET on one shard — no cross-shard queries.

Key Architectural Principles

1. Redis is the fast path, PostgreSQL is the safety net. If Redis dies, we rebuild from PostgreSQL in minutes. If PostgreSQL dies, rankings keep working from Redis.

2. One leaderboard = one ZSET = one shard. No cross-shard queries. Every ranking operation is a single-node operation.

3. Cache the head, query the tail. Top-K is identical for everyone — cache aggressively. Rank lookups are per-user — serve live from Redis.

4. Async fan-out for secondary leaderboards. Primary leaderboard updated synchronously. Daily/weekly/regional updated via Kafka consumers.

Request Flow — Step Through

Game Client→API Gateway→Write Service→Idempotency Check→PostgreSQL→Redis ZSET→Kafka→Consumer Workers

Click Next Step to walk through the request flow.

06

Deep Dive — Real-Time Ranking at Scale

The entire leaderboard design rests on one data structure: the skip list inside Redis ZSET. Let's understand exactly why it's perfect for ranking, how it achieves O(log N) on all operations, and what happens when it's not enough.

Why Ranking Is a Data Structure Problem

You have 100M scores. You need two queries: top-K and "what's my rank?" A flat array needs O(n log n) sort for top-K and O(n) scan for rank. A B-Tree (database index) can do top-K efficiently but rank still requires a range scan — the B-Tree doesn't store subtree sizes. The answer: a data structure that maintains sorted order incrementally with position tracking.

The Skip List — How Redis ZSET Works

A skip list is a layered linked list. Level 0 is a sorted linked list of all elements. Higher levels are "express lanes" that skip over multiple elements. Think of it like a subway — local train (level 0), express (level 1), super-express (level 2). Search starts at the highest level and drops down.

sequenceDiagram participant C as Game Client participant GW as API Gateway participant WS as Write Service participant R as Redis ZSET participant PG as PostgreSQL participant K as Kafka participant CW as Consumer Worker C->>GW: POST /scores (score=150) GW->>WS: Route (authenticated) WS->>WS: Check idempotency key WS->>PG: INSERT score_event PG-->>WS: OK (durable) WS->>R: ZINCRBY lb:global 150 user_a R-->>WS: new_score = 2450 WS->>K: Publish score.updated WS-->>C: 200 { new_score: 2450 } K->>CW: Consume event CW->>R: ZINCRBY lb:weekly 150 user_a CW->>R: ZINCRBY lb:daily 150 user_a

The Key Augmentation — Span Tracking

Each forward pointer in the skip list stores a span — the number of elements it skips over. To find rank, traverse from HEAD at the highest level, accumulating spans as you go. For 100M elements with 32 levels, that's ~27 hops. Sub-microsecond on modern hardware.

ZSET = Skip List + Hash Table

The skip list gives O(log N) sorted operations. The hash table maps member → (score, node pointer) for O(1) score lookup. Together: ZADD O(log N), ZSCORE O(1), ZREVRANK O(log N), ZREVRANGE O(log N + K).

The Tiebreaker Trick — Score Packing

Redis sorts by one dimension. For tiebreaking, pack score + inverted timestamp into one float64:

        composite = score × 10⁷ + (MAX_TIMESTAMP − actual_timestamp)

// Alice scores 2450 at t=1712345600 → 24502390099200
// Bob   scores 2450 at t=1712345700 → 24502390099100
// Alice's composite is higher → Alice ranks above Bob ✓

// Extract raw score: floor(composite / 10⁷) = 2450
      

IEEE 754 doubles have 52 bits of mantissa — integer precision up to 9 × 10¹⁵. Our composite uses at most 15 digits. Safely within range.

When One ZSET Isn't Enough — Sharding Strategies

A. Score-Range Sharding

Shard by score range. Rank = ZCARD of higher shards + rank within own shard. Fast rank queries. Uneven distribution, cross-shard migration on score change.

B. Hash-Based Sharding

Hash users across shards. Top-K = query all shards, merge. Rank = ZCOUNT on all shards. Even distribution. Every rank query fans out to all shards.

C. Tiered Approach (Production Answer)

Tier 1: ZSET of top 1M users (~110 MB). Exact ranking, fast, unsharded.
Tier 2: Score histogram (Fenwick tree) for all users. "How many users have score > X?" as a prefix sum in O(log B).
Result: exact rank for top players, approximate percentile for the rest. This is how Xbox Live and Steam work at scale.

The Decision Tree

Users	Ranking Engine	Precision
< 1M	PostgreSQL with B-Tree index	Exact (COUNT query fast enough)
1M – 100M	Single Redis ZSET	Exact for all users
100M – 1B	Tiered: ZSET + histogram	Exact top-1M, percentile rest
1B+	Streaming top-K (Space-Saving)	Exact top-1K, approximate rest

07

Store raw score. Fetch tied users' timestamps from PostgreSQL at query time. Adds latency, complexity, and a DB dependency on the read path.

08

What Can Go Wrong

Redis Master Crashes

Downtime: 5-10 seconds. Redis Sentinel promotes the replica. Async replication means 0-100ms of writes may be lost — but PostgreSQL has them. Reconciliation worker fixes drift within 15 minutes. Mitigation: use WAIT 1 100 for semi-sync replication to eliminate the data loss window.

Score Drift Between Redis and PostgreSQL

Caused by partial write failures, replication lag during failover, or consumer bugs. Reconciliation worker samples 10,000 users every 15 minutes, overwrites Redis with PostgreSQL values on mismatch. For high-stakes queries (user checking own rank), the Read Service can do an inline validation — compare Redis score with PostgreSQL, trigger immediate correction if they differ (+5ms).

Hot Partition — Viral Leaderboard Spike

A streamer with 5M followers says "check the leaderboard!" — 500K rank queries in 10 seconds. CDN absorbs top-K reads. Redis handles 50K ZREVRANK ops/sec (within capacity). Read Service auto-scales. Profile Service circuit breaker opens — serve ranks without profile enrichment if needed. Graceful degradation, not failure.

Idempotency Store Loss

If the Redis SET holding idempotency keys is lost, duplicate score updates get applied. Fallback: Write Service checks PostgreSQL score_events table for the event_id before applying. Slower (~5ms vs 0.1ms) but correct. Rebuild idempotency keys from recent events via bulk SADD.

Cheater Injection at Scale

Fraudulent scores from compromised game servers. Defense layers: IP-restricted service tokens, per-account anomaly detection (score velocity, geographic impossibility), and surgical rollback — reverse exactly the fraudulent events from the event log without affecting legitimate players.

Full Redis Cluster Loss (Disaster Recovery)

Recovery: 15-45 minutes. Serve stale CDN cache for top-K. Provision new cluster. Rebuild ZSETs from PostgreSQL user_scores (bulk ZADD ~1M entries/sec). Replay recent events from score_events. Faster path: restore from hourly RDB snapshot in S3 → recovery in 5-10 minutes.

Failure	Detection	Recovery	Data Risk
Redis master crash	5 sec	5-10 sec	0-100ms writes
Score drift	15 min	Auto-fix	None (PG is truth)
Hot partition	Seconds	30-60 sec autoscale	None
Idempotency loss	Minutes	Seconds (fallback)	Possible duplicates
Cheater injection	Min-hours	Minutes (rollback)	Integrity corruption
Full cluster loss	Immediate	15-45 min	Last seconds

09

Interview Tips

💡

Start with the two queries, not the architecture.
Clarify first: does the system need top-K only, or top-K plus arbitrary rank lookup? Top-K alone can use a min-heap or periodic sort. Rank lookup is what forces you toward skip lists. Frame Redis ZSET as a consequence of the problem, not a starting assumption.

⚡

Derive scale, don't declare it.
Bad: "Let's say 10K writes/sec." Good: "20M DAU × 6 matches × 20-min avg = 120M updates/day = 1,400/sec avg, 5,600/sec peak." Every number connects to the previous through a chain of defensible assumptions.

🎯

Explain why PostgreSQL alone doesn't work before introducing Redis.
Walk through: "COUNT(*) WHERE score > X on 100M rows with a B-Tree index ≈ 200ms. Budget is 50ms. That gap is what pushes us to an in-memory skip list." Then add: "Under 1M users, PostgreSQL is absolutely fine — Redis is only necessary at scale."

🏆

The tiebreaker question is your chance to shine.
Crisp answer: "Pack score × 10⁷ + (MAX_TS − actual_ts) into one float64. Earliest timestamp = higher composite for equal scores. Redis sorts natively. IEEE 754 gives 52 bits of mantissa — safely within precision for scores up to 10⁸."

📐

Know your sharding boundary and be honest.
"This works for 100M users. Beyond that, the ZSET can't fit on one shard. I'd move to tiered ranking — exact for top 1M, percentile for the rest. The API already returns percentile, so it's backward-compatible." Knowing where your design breaks is a sign of maturity.

🛡️

Don't forget anti-cheat — it separates senior from mid-level.
Scores from game servers only (never client-direct). Anomaly detection on event stream. Surgical rollback via event log. Rate limiting on score submission endpoints. This shifts the conversation from infrastructure to product thinking.

⚠️

Common mistakes to avoid.
• Saying "Redis handles everything" without proving it with throughput/memory numbers.
• Not addressing what happens when Redis goes down (always mention PostgreSQL backing store).
• Pre-computing friends leaderboards (walk through the 200× write amplification math).
• Over-engineering consistency — eventual is genuinely fine for leaderboards.

10

1

MVP — Single Server, Single Redis

< 1M users, < 500 writes/sec. One API server, one Redis instance (2 GB), one PostgreSQL. Can even skip Redis and use PostgreSQL COUNT(*) directly — returns in 5-10ms at this scale. One engineer, one week to build.

2

Production — Read/Write Split, Replication, CDN

< 50M users, < 5K writes/sec. Separate Read and Write services. Redis master-replica for failover. CDN caches top-K (5s TTL). Kafka for async fan-out to secondary leaderboards. Idempotency layer + reconciliation worker. This is where most companies stop — it's enough.

3

Scale — Redis Cluster, Batch Ingestion

< 100M users, ~60 GB Redis. 4-shard Redis Cluster for memory capacity (not throughput). Batch score endpoint with Redis pipelines. Profile enrichment as a separate service. Composite score packing for tiebreaking. Admin moderation endpoints. This is our target architecture.

4

Large Scale — Tiered Ranking, Multi-Region

< 1B users, < 50K writes/sec. Top-1M in a small ZSET (exact rank), rest via score histogram (percentile). Multi-region deployment: writes to primary region, cross-region Kafka replication, local reads. Kafka as write buffer (202 Accepted, ~100-500ms rank lag).

5

Planet Scale — Streaming Top-K, OLAP

1B+ users, 100K+ writes/sec. Space-Saving algorithm for approximate top-K with O(1) updates. OLAP engine (Druid, ClickHouse) for multi-dimensional filtered rankings. Exact rank only for top 1K. Everyone else gets percentile. This is Spotify Wrapped / Xbox Live territory.

📺

References & Videos

Design a Leaderboard

Arpit Bhayani · 20 min

Redis Sorted Sets for Leaderboards

Hussein Nasser · 15 min

Design a Leaderboard

AlgoMaster

Building Real-Time Leaderboards with Redis

Redis

Problem Statement

The Two Queries That Drive the Design

Requirements

Functional Requirements

Non-Functional Requirements

Scale Estimation

Write Path Derivation

Memory Derivation

Critical Insight

API Design

High-Level Architecture

Key Architectural Principles

Deep Dive — Real-Time Ranking at Scale

Why Ranking Is a Data Structure Problem

The Skip List — How Redis ZSET Works

The Key Augmentation — Span Tracking

ZSET = Skip List + Hash Table

The Tiebreaker Trick — Score Packing

When One ZSET Isn't Enough — Sharding Strategies

A. Score-Range Sharding

B. Hash-Based Sharding

C. Tiered Approach (Production Answer)

The Decision Tree

Key Design Decisions & Tradeoffs

Ranking Engine

Redis ZSET (Skip List)

PostgreSQL B-Tree

Write Path

Synchronous Dual-Write

Event-Driven (Kafka Buffer)

Leaderboard Scoping

Independent ZSET per Scope

Computed Scoped Rankings

Rank Precision

Exact Rank for All Users

Tiered (Exact Top + Percentile)

Friends Leaderboard

Computed On Demand

Pre-Computed Per-User ZSET

Tiebreaking

Composite Score Packing

Application-Level Secondary Sort

What Can Go Wrong

Redis Master Crashes

Score Drift Between Redis and PostgreSQL

Hot Partition — Viral Leaderboard Spike

Idempotency Store Loss

Cheater Injection at Scale

Full Redis Cluster Loss (Disaster Recovery)

Interview Tips

Similar Problems

Rate Limiter

Trending Topics / Top-K Streams

News Feed Ranking

Search Autocomplete

Design a Cache (Redis)

Evolution

MVP — Single Server, Single Redis

Production — Read/Write Split, Replication, CDN

Scale — Redis Cluster, Batch Ingestion

Large Scale — Tiered Ranking, Multi-Region

Planet Scale — Streaming Top-K, OLAP

References & Videos

Rate Limiter

Trending Topics / Top-K Streams

HyperLogLog & Sketches