System Design — 016

Count Unique Active Users

Design a system to count DAU, WAU, and MAU at massive scale — balancing exact accuracy for financial reporting against real-time approximate answers for dashboards, using probabilistic data structures.

HyperLogLogLambda ArchitectureStream ProcessingProbabilistic Counting
01

Problem Statement

"How many unique users were active on our platform yesterday?" A seemingly simple question that becomes an engineering challenge at scale. At 200M DAU generating 6 billion events/day, a naive SELECT COUNT(DISTINCT user_id) would scan billions of rows and take your database down.

"Active" must be defined — we use any user who triggered at least one meaningful event (page view, click, stream, search). "Unique" must be deduplicated across multiple devices, sessions, and servers. Time windows (DAU/WAU/MAU) must be composable — MAU is not the sum of 30 DAUs.

The core tension: exact counting requires O(n) memory — 150 TB for all segments at scale. HyperLogLog reduces this to 108 MB with 0.81% error. The architecture must serve both paths.

Core question: How do you count unique active users across multiple time windows, at massive scale, with tunable accuracy-vs-cost tradeoffs, serving both real-time dashboards and exact batch reports?

02

Requirements

Functional Requirements

  • Count unique users within a time window — DAU, WAU, MAU, or any custom date range with deduplication across devices and sessions
  • Support multiple granularities — hourly, daily, weekly, monthly; and arbitrary ranges like "March 10–25"
  • Segment by dimensions — country, platform (iOS/Android/Web), feature, user cohort — each multiplying the counter count
  • Expose via API and dashboard — REST API for programmatic access (ad pricing, alerts) and real-time visual dashboard for PMs/execs
  • Historical trend queries — "Show DAU trend for last 90 days" or "Compare this WAU to same week last year"

Non-Functional Requirements

  • Scale: Ingest 6B events/day (~70K/sec avg, 210K/sec peak) from 200M DAU
  • Low-latency reads: Dashboard queries return in < 1 second via pre-computed results
  • Tunable accuracy: Approximate (HLL, ±0.81%) for dashboards; exact (batch) for financial/investor reports
  • Fault tolerance: No event loss — Kafka replication factor 3, Flink checkpointing, idempotent producers
  • Idempotency: Duplicate events must not inflate counts — HLL naturally deduplicates by user_id
  • Freshness: Real-time path reflects activity within 1–5 minutes
  • Cost efficiency: Memory must scale sub-linearly — 150 TB exact vs 108 MB HLL for all segments

Key NFR conflict: Accuracy vs Freshness is irreconcilable in a single path — this forces the Lambda Architecture with separate real-time (approximate) and batch (exact) pipelines.

03

Scale Estimation

Every number is derived from stated assumptions. The DAU/MAU ratio of 40% and 30 events/user/day are industry benchmarks for engagement-driven platforms.

200M
Daily Active Users
70K/sec
Avg Events Ingested
210K/sec
Peak Events
3 TB/day
Raw Event Storage
108 MB
HLL Memory (all segments)
150 TB
Exact Set Memory
~1 PB/yr
Annual Event Archive
1.4M×
Memory Reduction (HLL)

Derivation

500M MAU × 40% DAU/MAU = 200M DAU. Each active user generates ~30 events/day → 6B events/day ÷ 86,400 = ~70K/sec (peak 3× = 210K/sec). At 500 bytes/event → 3 TB/day raw storage. HLL: 3 windows × 3,000 segments × 12 KB = 108 MB. Exact: 200M users × 48 bytes × 3 windows × 3,000 segments = 150 TB. The 1.4-million-fold gap forces probabilistic counting for real-time.

MetricValueArchitectural Signal
Kafka brokers6+ (RF=3)~10 TB/broker for 7-day retention
Flink task slots7–10~30K events/sec per slot
Redis memory~25 GB30 days of hourly HLL sketches
Spark batch (DAU)3 TB scan~2 hour job window
Spark batch (MAU)90 TB scanIncremental computation required
04

API Design

Event Ingestion — POST /v1/events
{
  "events": [{
    "event_id":     "evt_8f3a2b91c4d7",   // Client UUID — idempotency key
    "user_id":      "usr_204871",          // Canonical authenticated user ID
    "anonymous_id": "anon_d84f2c9e",       // Pre-login device ID
    "event_type":   "page_view",           // page_view | click | search | stream
    "timestamp":    "2026-04-06T14:23:01Z",// Client-side event time
    "properties": {
      "platform": "ios",
      "country":  "IN",
      "feature":  "search",
      "session_id": "sess_7a2b3c"
    }
  }]
}
// Response: 202 Accepted (async — written to Kafka, not yet counted)
// { "accepted": 98, "rejected": 2, "errors": [...] }

Events are batched (50–100 per request) to reduce TCP overhead from 70K to ~1K requests/sec. 202 Accepted (not 200 OK) because processing is async. Client-generated event_id enables idempotent retry. Both user_id and anonymous_id support identity stitching across devices.

Query Unique Counts — GET /v1/metrics/unique-users
GET /v1/metrics/unique-users
  ?metric=dau
  &start=2026-04-01&end=2026-04-06
  &granularity=daily
  &filters[country]=IN,US
  &filters[platform]=ios
  &mode=approximate        // approximate (HLL) | exact (batch)
  &compare_to=previous_period

// Response:
{
  "mode": "approximate",
  "accuracy": { "standard_error": "0.81%", "method": "hyperloglog" },
  "data": [
    { "date": "2026-04-01", "unique_users": 14832901,
      "comparison": { "date": "2026-03-25", "change_pct": "+4.38%" } },
    ...
  ],
  "freshness_lag": "2m 14s"
}

The mode parameter exposes the Lambda dual-path directly. accuracy metadata lets consumers decide if precision is sufficient. freshness_lag shows pipeline delay. The granularity parameter leverages HLL's mergeability — only hourly sketches are stored; daily/weekly/monthly are derived via PFMERGE.

Anomaly Alerts — POST /v1/alerts/rules
{
  "name": "DAU drop alert",
  "metric": "dau",
  "condition": {
    "type": "percentage_change",
    "threshold": -15,
    "compared_to": "same_day_last_week",
    "window": "1h"
  },
  "notify": { "webhook": "https://hooks.slack.com/..." }
}

Alerting is first-class because silent data loss — the most dangerous failure — produces plausible but wrong numbers that nobody investigates unless automated detection catches it.

05

High-Level Architecture

A Lambda Architecture with three paths: real-time (speed), batch (truth), and serving (unified query). Every component exists because a specific scale constraint demanded it.

Client SDKs iOS / Android / Web Ingestion API Validate + Enrich Kafka user-events · 64 parts Identity Stitch anon_id → user_id Apache Flink HLL Sketches · 108 MB Redis Cluster PFADD / PFMERGE S3 Archive Parquet · 3 TB/day Apache Spark Exact COUNT DISTINCT PostgreSQL Exact Counts Query API approx | exact Dashboard PM / Exec / Ads Anomaly Detect Canary + Alerts HTTPS Produce Consume Push HLL Read Parquet Write Approx Exact Speed Layer (real-time) Batch + Serving Layer Monitoring
Request Flow — Step Through
Client SDKIngestion APIKafkaFlink (HLL)RedisS3 ArchiveSpark (Batch)Query APIDashboard
Click Next Step to walk through the request flow.
06

Deep Dive — HyperLogLog vs Exact Counting

The defining data structure choice. HyperLogLog lets you count 500M unique elements in 12 KB of memory with 0.81% standard error — a 1.4-million-fold reduction vs exact sets. Here's why it works.

The Coin-Flip Intuition

Imagine each user flips a coin until they get heads, then reports the number of tails. You only track the maximum tails reported. If the max is R, you've probably seen ~2R unique people. A run of 20 tails (probability 1/220 ≈ 1 in a million) means ~1M unique visitors. One number, O(1) memory.

In practice: hash each user_id to a binary string, count leading zeros (equivalent to tails before heads). A single estimator has ~70% error. HLL fixes this with stochastic averaging — split elements into 16,384 buckets (using the first 14 bits of the hash), track max leading zeros per bucket, and combine using the harmonic mean.

Memory

16,384 registers × 6 bits each = 12 KB. Constant regardless of cardinality — works for 100 or 500M elements.

Error

Standard error = 1.04 / √16384 = 0.81%. For 200M DAU: ±1.62M (95% confidence ±3.24M).

The Killer Feature: Mergeability

Two HLL sketches merge by taking the register-wise maximum. The merged sketch equals a single sketch built over the union of both sets, with identical error bounds. This enables:

  • Hourly → Daily: Merge 24 hourly sketches → DAU
  • Daily → Monthly: Merge 30 daily sketches → MAU
  • Cross-partition: Merge 10 Flink task slot sketches → global count
  • Arbitrary ranges: "Users from March 10–25" → merge 16 daily sketches

Limitation: HLL supports union (merge) but NOT intersection or difference. "Users on iOS AND Android" requires Theta Sketches (Apache DataSketches) which support all set operations at ~32 KB, 2.7× more memory.

sequenceDiagram participant SDK as Client SDK participant API as Ingestion API participant K as Kafka participant F as Flink participant R as Redis SDK->>API: POST /v1/events (batch of 50) API->>K: Produce to partition hash(user_id) % 64 K->>F: Consume event F->>F: hash(user_id) → bucket #6746 F->>F: Leading zeros = 5 → Register[6746] = max(old, 5) Note over F: Repeat for each segment:
global, country:IN, platform:ios F->>R: PFMERGE hourly HLL sketches (every 30s) Note over R: PFCOUNT → approximate DAU
PFMERGE 24 hourly → daily

Comparison Table

StructureMemory (200M, 1 ctr)Memory (9K ctrs)AccuracyMergeIntersect
HashSet9.6 GB150 TBExactYes (slow)Yes
Roaring Bitmap200 MB–1 GB~9 TBExactYes (fast)Yes
HyperLogLog12 KB108 MB±0.81%Yes (trivial)No
Theta Sketch32 KB288 MB±1.4%YesYes
Bloom Filter250 MB2.25 TBN/A (not a counter)Yes≈ Yes

Redis HLL in Practice

# Add user to hourly sketch
PFADD hll:hourly:2026040614:country:IN "usr_204871"
→ (integer) 1   # sketch modified — new element

# Same user again — no change
PFADD hll:hourly:2026040614:country:IN "usr_204871"
→ (integer) 0   # already counted!

# Merge 24 hourly sketches → daily
PFMERGE hll:daily:20260406:country:IN \
    hll:hourly:2026040600:country:IN \
    ... \
    hll:hourly:2026040623:country:IN

# Get DAU
PFCOUNT hll:daily:20260406:country:IN
→ (integer) 14832901   # ±0.81%
07

Key Design Decisions & Tradeoffs

Lambda vs Kappa Architecture

✓ Chosen

Lambda (Dual-Path)

Separate real-time approximate (Flink + HLL → Redis) and batch exact (Spark → PostgreSQL) paths. Serves both dashboard and finance needs.

✗ Alternative

Kappa (Single Stream)

Single Flink pipeline for everything. Simpler ops, but can't serve exact counts without O(n) memory. Works if approximate is always acceptable.

HyperLogLog vs Theta Sketch

✓ Chosen

HLL (Primary) + Theta (Secondary)

HLL for DAU/WAU/MAU — Redis native PFADD/PFCOUNT, 12 KB, 0.81% error. Theta Sketches for intersection queries only. Best of both worlds.

✗ Alternative

Theta Sketch Everywhere

Supports union + intersection + difference natively. But 2.7× more memory (32 KB), no Redis native support, more complex internals.

Event-Time vs Processing-Time Windows

✓ Chosen

Event-Time (Client Timestamp)

Correctly assigns offline/late events to their actual day. Subway user's 11:55 PM events count as today, not tomorrow. Requires watermarks + allowed lateness.

✗ Alternative

Processing-Time (Server Clock)

Simpler — no watermarks needed. But systematically miscounts late-arriving events. Offline mobile users inflate next-day DAU.

Kafka Partition Key

✓ Chosen

Partition by user_id

All events for a user land on the same partition → local deduplication in Flink. Enables per-user logic. Risk: hot partitions from celebrity/bot accounts.

✗ Alternative

Random (Round-Robin)

Perfect load balancing, no hot partitions. But user events scatter across partitions — can't do per-user logic without global coordination.

Redis Update Model

✓ Chosen

Push from Flink (Batched)

Flink pushes merged HLL sketches to Redis every 30–60s. Redis receives ~2 ops/sec instead of 210K. Adds 30–60s latency.

✗ Alternative

Direct PFADD per Event

Minimal latency. But 210K PFADD/sec at peak pushes Redis single-threaded limits. Requires many shards and complex routing.

Identity Resolution Strategy

✓ Chosen

Best-Effort Real-Time + Batch Reconciliation

Real-time maps known anonymous IDs via Redis lookup. Batch path does full retroactive stitching. Accepts 1–3% overcounting in real-time.

✗ Alternative

Real-Time Only

Simpler pipeline. But unresolved anonymous IDs permanently overcount. No correction path. Error compounds over time.

08

What Can Go Wrong

🔴 Silent Data Loss — The #1 Threat

No crash, no error, but events silently stop flowing. A broken SDK, misconfigured load balancer, or Kafka retention misconfiguration causes DAU to drift down without alerts. Defense: End-to-end canary events (synthetic test users verified at the output), volume-based anomaly detection per platform/country, cross-path consistency checks (real-time vs batch divergence > 3% triggers investigation).

Kafka Broker Crash

Leader partitions become unavailable for 5–30s during election. At 1,100 events/sec per partition, ~55K–330K events queue during failover. Defense: Replication factor 3 ensures no data loss. Idempotent producer (enable.idempotence=true) prevents duplicates on retry. Auto-recovery via leader election.

Flink Task Slot Crash

In-memory HLL state for current hour is lost. ~10% of events stop flowing into sketches. Defense: Checkpointing to S3 every 1–2 minutes. On crash, Flink restores from checkpoint and replays events from Kafka. HLL idempotency means replayed events don't double-count. Recovery time: 30–120 seconds.

Redis Cluster Failure

Dashboard queries fail. No permanent data loss (Flink still processes and checkpoints). Defense: Redis replicas with auto-failover (< 30s). Query API falls back to batch results from PostgreSQL with "data_source": "batch_fallback" flag. Flink buffers HLL updates and replays on Redis recovery.

Identity Stitching Failure

Service returns stale mappings or times out. Users counted under both anonymous and authenticated IDs → DAU inflated 5–15%. Dangerous because DAU going UP looks like good news. Defense: Circuit breaker at 5% error rate. Fallback to user_id-only counting (undercounts, which is safer). Batch reconciliation catches within 24h.

Clock Skew / Timestamp Corruption

Misconfigured device clocks send events timestamped in 2024 or 2030. With event-time windowing, these corrupt wrong time windows. Defense: Reject events where |client_timestamp - server_received_at| > 7 days. Watermark allowed lateness of 24h. Monitor clock skew distribution — alert if >1% of events have skew > 1 hour.

Spark Batch Job Failure

Executor OOM, S3 timeout, or logic bug produces incorrect exact counts. Defense: Sanity checks before commit: DAU within 20% of previous day, within 5% of HLL estimate, MAU ≥ WAU ≥ DAU. Failed checks write to staging table, not production. Immutable Parquet input enables safe re-runs.

09

Interview Tips

💡
Lead with "What does active mean?"
This clarifying question signals seniority. The definition (login vs event vs session duration) changes the entire data pipeline. Junior candidates jump straight to HyperLogLog; senior candidates nail the requirements first.
Derive HLL from the memory calculation
Proactively compute: "200M users × 48 bytes = 9.6 GB per set. With 3,000 segments × 3 windows = 150 TB. HLL brings this to 108 MB — a 1.4M× reduction at 0.81% error." This back-of-envelope is the most impressive moment in the interview.
🎯
Name Lambda, then critique it
Say: "This is a Lambda Architecture." Then immediately: "The cost is two codepaths and divergent counts. If exact isn't needed, I'd simplify to Kappa." Naming + critiquing + offering alternatives = senior-level signal.
🔑
HLL mergeability is your trump card
"I store hourly HLL sketches because HLL merge is lossless — 24 hourly sketches merge to DAU, 720 to MAU, with identical accuracy. Only hourly granularity is computed; all coarser windows are derived."
🛡️
Preempt the intersection limitation
Before the interviewer asks: "HLL only supports union, not intersection. For 'users on iOS AND Android,' I'd use Theta Sketches from Apache DataSketches — 32 KB, support union + intersect + difference."
🚨
Silent data loss is your differentiator
"The most dangerous failure isn't a crash — it's silent data loss. A broken SDK drops 15% of events and nobody notices. My defense: end-to-end canary events from synthetic test users, verified at the output."
11

Evolution

How this design grows from MVP to planet-scale. Add complexity only when a specific constraint forces it.

1

MVP — PostgreSQL + Cron (0–1M DAU)

SELECT COUNT(DISTINCT user_id) FROM events WHERE timestamp >= CURRENT_DATE. Runs hourly on the production DB. Simple, exact, and fast enough. Breaks when analytics queries compete with production traffic.

2

Read Replica + Redis HLL (1M–10M DAU)

Dedicated read replica for analytics. Redis PFADD per event for real-time approximate DAU on the dashboard. First introduction of HyperLogLog. Total HLL memory: a few hundred KB.

3

Streaming Pipeline (10M–50M DAU)

Kafka + Flink replaces direct Redis writes. S3 archival + weekly Spark job for exact counts. Birth of Lambda Architecture. First "dashboard says X, report says Y" moment — add mode=approximate|exact API parameter.

4

Full Production (50M–500M DAU)

Identity stitching service. 3,000+ segments. Anomaly detection with canary events. Redis Cluster for 25 GB of hourly HLL sketches. Incremental MAU computation with Roaring Bitmaps. Cross-path consistency checks.

5

Planet Scale — Multi-Region (500M+ DAU)

Regional Kafka + Flink clusters. Global HLL merge aggregator (associative + commutative — order doesn't matter). Tiered storage (S3 Standard → Glacier). HLL precision tuning per segment. Evolving toward streaming-first with batch as audit-only.

Next up