You have 1 billion URLs you've already crawled. Before crawling a new URL, you want to check "have I seen this before?" A hash set costs ~100 GB of RAM for 1B URLs. Most queries will be for URLs you've never seen. You just want to skip the expensive DB lookup when you already know the answer.
A Bloom filter answers "is this item in the set?" using a few megabytes instead of gigabytes. Catch: small false-positive rate (says yes when answer is no). Zero false negatives — if it says "no," it's definitely no.
Perfect for "skip the DB read if we already know we won't find it." Used by Cassandra, HBase, Chrome's malware checker, web crawlers, and every serious caching system.
02
How it works
A Bloom filter is a bit array of m bits plus k hash functions.
Add an item: hash it k times, set those k bits to 1.
Check an item: hash it k times, check those k bits. If all are 1 → probably in the set. If any is 0 → definitely not in the set.
A false positive happens when the k bits for a never-seen item happen to all be set by other items' hashes. Tunable — more bits per item + more hash functions = lower false-positive rate, at the cost of memory and CPU.
Bloom Filter OperationsSVG
03
Sizing — the math that matters
For n items and target false-positive rate p:
Bits needed: m = −n × ln(p) / (ln 2)²
Optimal hash functions: k = (m/n) × ln 2
In practical terms:
Bloom filter in ~25 lines
import mmh3
from bitarray import bitarray
class BloomFilter:
def __init__(self, size=10_000_000, k=7):
self.size = size
self.k = k
self.bits = bitarray(size)
self.bits.setall(0)
def _hashes(self, item):
h1 = mmh3.hash(item, 0) % self.size
h2 = mmh3.hash(item, 1) % self.size
return [(h1 + i * h2) % self.size for i in range(self.k)]
def add(self, item):
for h in self._hashes(item):
self.bits[h] = 1
def contains(self, item):
return all(self.bits[h] for h in self._hashes(item))
# 10M bits + 7 hashes → ~1% false-positive at ~1M inserts; 1.2 MB total
Items (n)
False-positive rate
Size
Hashes (k)
1M
1%
1.2 MB
7
1M
0.1%
1.8 MB
10
1B
1%
~1.2 GB
7
1B
0.01%
~2.4 GB
14
Compare: storing 1B 64-byte URLs in a hash set = 64 GB minimum, probably 100+ GB with overhead. The Bloom filter uses ~2% of that.
04
What Bloom filters can't do
Remove items. Unsetting a bit might clear part of another item's fingerprint. Use a counting Bloom filter (each cell is a counter) if deletions matter.
Enumerate items. You can't get back the list you put in. Bloom filters are lossy.
Guarantee positives. You still need the backing store when the filter says "yes." The filter only saves you when it says "no."
Rule of thumb: a Bloom filter is a negative cache. It eliminates reads you'd have made to the real store for items that definitely aren't there.
05
Deep dive — when it shines
LSM-tree lookups (Cassandra, RocksDB, HBase). Each level has Bloom filters. On lookup, check the filter for a level; if "no," skip. Avoids reading millions of SSTables per query. Classic use case.
Web crawler. 100B URLs. "Have I seen this URL?" Bloom filter says no 99% of the time → skip DB entirely. The 1% false positives cost an occasional unnecessary DB check, but system-wide saves 99× the reads.
Chrome's malware checker. Local Bloom filter of known bad URLs. Most URLs the user visits don't match → zero network call. Hits require a network check against Google's real database.
CDNs and deduplication. "Have I seen this file already?" Before computing hashes or storing the file, Bloom filter check on chunk fingerprint. Saves massive I/O in backup systems.
Interview answer
"Use a Bloom filter to short-circuit lookups for items that aren't in our set. Size it for the target false-positive rate; 1% is usually plenty because the backing store confirms positives anyway. Memory savings are 20–50× vs a hash set."
06
Real-world
Cassandra / RocksDB
Per-SSTable Bloom filter
Each SSTable has its own filter. Lookups skip SSTables that definitely don't contain the key. Essential to LSM-tree performance.
Redis
RedisBloom module
Bloom filter as a Redis data type. BF.ADD, BF.EXISTS. Scalable filters grow capacity dynamically.
Bitcoin SPV wallets
Privacy-preserving queries
Clients use Bloom filters to request only transactions "probably involving me" from full nodes, without revealing exactly which addresses.
Chrome Safe Browsing
Local prefix filter
Hash prefixes of known malicious URLs. Client checks locally first; only falls back to Google's server on match.
07
Used in problems
Web crawler uses Bloom filters for URL deduplication at scale. URL shortener can use them to reject obviously-invalid codes before hitting the DB. Mutual connections uses them to reject non-friend lookups.