HyperLogLog & Sketches

01

Why this matters

"How many unique visitors did we have today?" Naive: keep a Set<user_id>. For 100M users, that's 6+ GB of RAM. Do it per-URL or per-minute and you're out of memory fast. HyperLogLog answers the same question with 12 KB, 99% accuracy. Count-Min Sketch estimates "how often was element X seen?" in similar space. Top-K finds the most frequent items without tracking everyone.

These approximate data structures are the only way to do cardinality, frequency, and ranking analytics at scale. Used by Google, Cloudflare, Redis, Postgres, every ad-tech platform.

02

HyperLogLog — intuition

Key insight: if you hash items to a uniform random bit string, the maximum number of leading zeros you've seen gives a probabilistic estimate of how many unique items you've hashed. Seen a hash with 10 leading zeros? You've probably seen ~2¹⁰ = 1024 distinct items (since the probability of a random hash starting with 10 zeros is 1/1024).

Using just one counter is noisy, so HLL splits items into buckets (via the first few hash bits) and tracks max-leading-zeros per bucket. With m=16384 buckets, you get ~0.81% standard error using ~12 KB. Extraordinary accuracy/memory tradeoff.

Unions are free: HLL(A ∪ B) = element-wise max of their bucket arrays. Lets you aggregate cardinality across dimensions (per-region, per-hour) cheaply.

03

The approximate-structure zoo

Structure	Answers	Memory	Error
Bloom filter	"Is X in set?"	~10 bits per item	False positives only
HyperLogLog	"How many unique items?"	~12 KB constant	~0.8% standard error
Count-Min Sketch	"How many times was X seen?"	~width × depth × 4 bytes	Overestimates bounded by ε·N
t-Digest	"What's the P99 latency?"	~1 KB	Sub-1% on extreme percentiles
MinHash	"Jaccard similarity between sets?"	~k × 4 bytes	O(1/√k) std error
Top-K (Space-Saving)	"Top N most frequent items?"	~k × (size + counter)	Exact for items above threshold

04

Count-Min Sketch — the frequency counter

Maintain a 2D array of counters, width w × depth d. For each item X:

Hash X with d different hash functions → d bucket indices.
Increment the counter at each (row, bucket).

To query frequency of X: hash with same d functions, read d counters, return the minimum. The minimum is the tightest upper bound — the other rows may have been inflated by collisions.

Uses: rate limiting (per-IP request counts), heavy-hitter detection (spot the IP doing 10M RPS), ad-tech (approximate frequency capping).

HyperLogLog cardinality in 20 lines

import math, mmh3

class HLL:
    def __init__(self, p=14):  # p=14 → 16 KB, ~0.8% error
        self.p = p
        self.m = 1 << p
        self.registers = [0] * self.m

    def add(self, item):
        h = mmh3.hash64(item)[0] & ((1 << 64) - 1)
        idx = h >> (64 - self.p)
        w = (h << self.p) & ((1 << 64) - 1) | (1 << (self.p - 1))
        self.registers[idx] = max(self.registers[idx],
                                  64 - w.bit_length() + 1)

    def count(self):
        alpha = 0.7213 / (1 + 1.079 / self.m)
        z = sum(2.0 ** -r for r in self.registers)
        return int(alpha * self.m * self.m / z)

# 16 KB counts cardinalities up to billions with ~0.8% error

05

Deep dive — HLL at Cloudflare scale

Cloudflare publishes daily "unique visitors per zone" stats for every domain. Billions of events per day, millions of domains. Full-fidelity counting = impossible. Their solution:

Each edge PoP maintains a HLL per (zone_id, time_bucket) in memory.
Every hour, flush to a central store that UNIONs all edge HLLs → global cardinality per zone per hour.
Users query "unique visitors today" → union 24 hourly HLLs, return estimate.

Storage: one HLL per zone per hour = 12 KB. For 10M zones × 24 hours = 2.8 TB/day. The same data as exact counts would require petabytes. Query latency: microseconds for any aggregation window.

Interview use: "For the count-active-users problem we'll use HyperLogLog. Each server maintains a HLL sketch for the current window; we union across the fleet every minute for the global count. Space is constant regardless of user count. 0.8% error is well within product tolerance."

06

Real-world

Redis

PFADD / PFCOUNT / PFMERGE

Native HLL data type. 12 KB per key. Union multiple keys for cross-dimension aggregates.

Postgres + hll extension

Cardinality in SQL

Precompute HLLs per partition; queries union them at query time. Citus uses this heavily.

Google BigQuery / Snowflake

APPROX_COUNT_DISTINCT

Both default to HLL-based approximation for count-distinct at warehouse scale.

Druid / ClickHouse

HLL + t-Digest native

OLAP engines bake in approximate sketches for fast analytical queries over billions of rows.

07

Used in problems

Count active users is the canonical HLL problem. Leaderboard uses Top-K sketches for high-cardinality ranking. Recommendation algorithm uses MinHash for item-similarity at scale.

📺