"How many unique visitors did we have today?" Naive: keep a Set<user_id>. For 100M users, that's 6+ GB of RAM. Do it per-URL or per-minute and you're out of memory fast. HyperLogLog answers the same question with 12 KB, 99% accuracy. Count-Min Sketch estimates "how often was element X seen?" in similar space. Top-K finds the most frequent items without tracking everyone.
These approximate data structures are the only way to do cardinality, frequency, and ranking analytics at scale. Used by Google, Cloudflare, Redis, Postgres, every ad-tech platform.
02
HyperLogLog — intuition
Key insight: if you hash items to a uniform random bit string, the maximum number of leading zeros you've seen gives a probabilistic estimate of how many unique items you've hashed. Seen a hash with 10 leading zeros? You've probably seen ~2¹⁰ = 1024 distinct items (since the probability of a random hash starting with 10 zeros is 1/1024).
Using just one counter is noisy, so HLL splits items into buckets (via the first few hash bits) and tracks max-leading-zeros per bucket. With m=16384 buckets, you get ~0.81% standard error using ~12 KB. Extraordinary accuracy/memory tradeoff.
Unions are free: HLL(A ∪ B) = element-wise max of their bucket arrays. Lets you aggregate cardinality across dimensions (per-region, per-hour) cheaply.
03
The approximate-structure zoo
Structure
Answers
Memory
Error
Bloom filter
"Is X in set?"
~10 bits per item
False positives only
HyperLogLog
"How many unique items?"
~12 KB constant
~0.8% standard error
Count-Min Sketch
"How many times was X seen?"
~width × depth × 4 bytes
Overestimates bounded by ε·N
t-Digest
"What's the P99 latency?"
~1 KB
Sub-1% on extreme percentiles
MinHash
"Jaccard similarity between sets?"
~k × 4 bytes
O(1/√k) std error
Top-K (Space-Saving)
"Top N most frequent items?"
~k × (size + counter)
Exact for items above threshold
04
Count-Min Sketch — the frequency counter
Maintain a 2D array of counters, width w × depth d. For each item X:
Hash X with d different hash functions → d bucket indices.
Increment the counter at each (row, bucket).
To query frequency of X: hash with same d functions, read d counters, return the minimum. The minimum is the tightest upper bound — the other rows may have been inflated by collisions.
Uses: rate limiting (per-IP request counts), heavy-hitter detection (spot the IP doing 10M RPS), ad-tech (approximate frequency capping).
HyperLogLog cardinality in 20 lines
import math, mmh3
class HLL:
def __init__(self, p=14): # p=14 → 16 KB, ~0.8% error
self.p = p
self.m = 1 << p
self.registers = [0] * self.m
def add(self, item):
h = mmh3.hash64(item)[0] & ((1 << 64) - 1)
idx = h >> (64 - self.p)
w = (h << self.p) & ((1 << 64) - 1) | (1 << (self.p - 1))
self.registers[idx] = max(self.registers[idx],
64 - w.bit_length() + 1)
def count(self):
alpha = 0.7213 / (1 + 1.079 / self.m)
z = sum(2.0 ** -r for r in self.registers)
return int(alpha * self.m * self.m / z)
# 16 KB counts cardinalities up to billions with ~0.8% error
05
Deep dive — HLL at Cloudflare scale
Cloudflare publishes daily "unique visitors per zone" stats for every domain. Billions of events per day, millions of domains. Full-fidelity counting = impossible. Their solution:
Each edge PoP maintains a HLL per (zone_id, time_bucket) in memory.
Every hour, flush to a central store that UNIONs all edge HLLs → global cardinality per zone per hour.
Storage: one HLL per zone per hour = 12 KB. For 10M zones × 24 hours = 2.8 TB/day. The same data as exact counts would require petabytes. Query latency: microseconds for any aggregation window.
Interview use: "For the count-active-users problem we'll use HyperLogLog. Each server maintains a HLL sketch for the current window; we union across the fleet every minute for the global count. Space is constant regardless of user count. 0.8% error is well within product tolerance."
06
Real-world
Redis
PFADD / PFCOUNT / PFMERGE
Native HLL data type. 12 KB per key. Union multiple keys for cross-dimension aggregates.
Postgres + hll extension
Cardinality in SQL
Precompute HLLs per partition; queries union them at query time. Citus uses this heavily.
Google BigQuery / Snowflake
APPROX_COUNT_DISTINCT
Both default to HLL-based approximation for count-distinct at warehouse scale.
Druid / ClickHouse
HLL + t-Digest native
OLAP engines bake in approximate sketches for fast analytical queries over billions of rows.
07
Used in problems
Count active users is the canonical HLL problem. Leaderboard uses Top-K sketches for high-cardinality ranking. Recommendation algorithm uses MinHash for item-similarity at scale.