Concept · Databases

Indexing

01

Why this matters

A database table with 100M rows. A query: WHERE email = 'x@y.com'. Without an index, the DB scans all 100M rows — seconds per query. With a B-tree index, it's ~10 lookups, 0.5ms total. That's a 10,000× difference and the reason indexes exist.

But indexes aren't free — every write has to update the index too. The art is picking what to index (only hot-path queries), what kind (B-tree, hash, LSM, inverted), and in what order for composite indexes.

02

The four index families

IndexReads wellWrites wellBest for
B-treePoint lookup + range scan in O(log N)Moderate — log N rebalancingGeneral-purpose. Postgres, MySQL, Oracle default.
LSM-treeSlightly slower than B-tree (merge on read)Excellent — sequential writesWrite-heavy. Cassandra, LevelDB, RocksDB.
HashO(1) point lookup; no rangeFastExact-match only, tiny latency budgets. Redis, memcached.
InvertedFull-text / term lookupsSlow — per-term posting list updatesSearch. Elasticsearch, Lucene.
03

B-tree in one breath

A balanced tree where each internal node holds ~100–500 keys. For 1 billion rows, the tree is only 3–4 levels deep. Lookup = 3–4 random disk reads + final leaf scan = ~0.5–1ms.

Range scans are the killer feature: find the start, then walk leaf-level siblings. "Give me all orders from May" is one seek + a sequential scan, not a million lookups.

Every index row includes the indexed column(s) + a pointer to the actual row. A "covering index" includes every column the query needs, so you never have to fetch the actual row.

Cursor-based pagination (stable under insert)
-- ❌ OFFSET / LIMIT breaks: new rows shift everyone's view
-- SELECT * FROM posts ORDER BY id DESC LIMIT 20 OFFSET 40;

-- ✓ Cursor (opaque token = last-seen id or (ts, id) tuple):
SELECT id, title, created_at
FROM posts
WHERE (created_at, id) < ($1, $2)     -- cursor from previous page
ORDER BY created_at DESC, id DESC
LIMIT 20;

-- Encode cursor as base64(created_at || id); send to client.
-- Stable under concurrent inserts. O(log N) per page via composite index.

CREATE INDEX posts_ts_id_desc ON posts(created_at DESC, id DESC);
04

LSM vs B-tree — the write-heavy tradeoff

B-tree

Read-optimized

Every write updates an existing tree node in place → random I/O. Well-suited when reads dominate writes (10:1 or more). Space-efficient (no duplicates).

LSM-tree

Write-optimized

Writes go to an in-memory buffer, flushed to immutable sorted files. Background compaction merges files. Sequential writes — SSDs love this. Cost: reads may check multiple files (bloom filters help).

Rule of thumb

B-tree if read:write is 10:1 or higher. LSM if writes dominate (logs, metrics, IoT, feeds). At 1:1 it's a wash; pick B-tree for operational familiarity.

05

Deep dive — composite index ordering matters

You have orders(user_id, status, created_at). You query mostly by user + status. Which column order?

Index on (user_id, status, created_at):

  • WHERE user_id = 5 AND status = 'open' — ✅ uses index
  • WHERE user_id = 5 alone — ✅ uses index (prefix)
  • WHERE status = 'open' alone — ❌ can't use index (status is not leading)
  • WHERE user_id = 5 ORDER BY created_at — ✅ uses index for sort (no extra sort step)

Rule: leading column is the most selective / most common filter. Trailing columns are for secondary filters and sort orders. Columns between them are the constraint — you can't skip.

Don't over-index. Each index costs ~20–30% write throughput. Tables with 15 indexes are write-bound for no benefit. Start with the 2–3 most-used queries; measure; add only what's proven needed.

06

Real-world

Postgres B-tree

Default for everything

Covers 95% of use cases. GIN indexes for full-text and JSON. GiST for geospatial. Partial indexes for "active users only."

RocksDB (embedded LSM)

Under many distributed systems

Used inside CockroachDB, TiDB, Kafka streams. Great write throughput. Complex tuning of compaction settings.

Elasticsearch inverted

Search-first

Term → doc_id posting lists. Scores documents by TF-IDF / BM25. Trade: 2–5× disk footprint vs source data.

Redis sorted sets

Skiplist — a sort of index

O(log N) insert/remove, O(log N) range queries by score. Leaderboards, time-series, priority queues.

07

Used in problems

Typeahead uses inverted indexes + prefix tries. E-commerce uses B-tree indexes on product IDs and category. Yelp uses geospatial B-tree variants. News feed uses LSM-style writes in Cassandra for the feed store.

Next up