Concept · Databases

Database Types

01

Why this matters

"We'll use a database" is not an architecture decision — it's punting. Every system design interview eventually asks: Postgres or Cassandra? Redis or DynamoDB? Mongo or Elasticsearch? The answer is never "it depends" alone — it's "this workload needs X because Y."

This page is the decision matrix. Learn the shape of each, the failure modes of each, and which real workloads fit each.

02

The seven families

FamilyExemplarsBest forHard no
Relational (SQL)Postgres, MySQL, CockroachDBTransactions, complex queries, strong consistencyHuge write throughput on one table without sharding
Key-valueRedis, Memcached, DynamoDBCaches, sessions, hot lookups by single keyAny query that's not by key
DocumentMongoDB, Couchbase, DynamoDBFlexible schema, nested objects, per-object CRUDHeavy relational queries, multi-doc ACID
Wide-columnCassandra, HBase, ScyllaDBTime-series, massive writes, known access patternsAd-hoc queries, joins, strong consistency
GraphNeo4j, JanusGraph, Amazon NeptuneRelationship-heavy data (social graph, fraud detection)Simple CRUD, huge scale
Search (inverted index)Elasticsearch, OpenSearch, SolrFull-text search, faceting, relevance rankingSource-of-truth storage (use as secondary only)
Time-seriesInfluxDB, TimescaleDB, PrometheusMetrics, monitoring, IoT telemetryTransactional workloads, low cardinality
03

The ten databases you must know

PostgreSQL

Relational — the default

ACID, mature, JSON support, full-text search built in, extensions for everything. Scale limit ≈ single-box writes (~10k–50k/sec). Shard or switch when you outgrow.

MySQL

Relational — operational simplicity

Simpler than Postgres, slightly less featureful. Huge operator knowledge. Vitess shards it at YouTube scale.

Redis

In-memory KV + data structures

0.5ms reads, 1M ops/sec per node. Sessions, caches, leaderboards (sorted sets), pub/sub, rate limiting. Data in RAM — expensive for huge datasets.

Memcached

Pure KV cache

No persistence, no data structures, no replication. Just a cache. Sub-millisecond. Still relevant when you need simplicity.

DynamoDB

Managed KV/document, any scale

Single-digit ms at any throughput. Primary/sort keys drive data model. Expensive at large scale but zero ops.

Cassandra

Wide-column, write-heavy

Leaderless, linearly scalable writes. Discord, Netflix, Instagram. Eventual consistency by default. Hard to operate well.

MongoDB

Document, flexible schema

BSON documents, rich query language, secondary indexes. Sharded clusters. ACID on single doc; multi-doc since 4.0 (slower).

Elasticsearch

Inverted index, search

Full-text + structured filtering + aggregations. NOT your source of truth — use alongside Postgres.

Neo4j

Graph — relationship-first

Traversals over millions of edges in milliseconds. Fraud detection, recommendation, social networks. Not a general-purpose DB.

Kafka

Distributed log

Technically not a DB — but often used as the system of record. Event sourcing, audit logs, stream processing. Retention from hours to years.

04

The decision tree

  1. Do you need ACID transactions across multiple rows? → Postgres/MySQL. Nothing else does this well.
  2. Do you need single-key lookups in < 1ms? → Redis (in-memory) or DynamoDB (managed).
  3. Do you need > 100k writes/sec on one logical dataset? → Cassandra or DynamoDB. Avoid SQL sharding unless the team has scar tissue.
  4. Do you need full-text search with relevance scoring? → Elasticsearch. Alongside, never instead of, a source-of-truth DB.
  5. Is your data graph-shaped (traversals > joins)? → Neo4j or a graph extension.
  6. Is your data timestamped metrics? → Prometheus (metrics) or TimescaleDB (analytics).
  7. Default for greenfield apps?Postgres. Can handle most workloads up to mid-scale and the migration path is well-trodden.
LSM-tree sketch (memtable + SSTables)
# LSM = Log-Structured Merge tree. Writes: append to memtable + WAL.
# Reads: check memtable → bloom filters → SSTables from newest to oldest.

class LSM:
    def __init__(self, memtable_limit=10**6):
        self.memtable = {}  # sorted map (in-mem B-tree in real engines)
        self.wal = WAL("lsm.wal")
        self.sstables = []  # list of on-disk immutable files, newest first

    def put(self, k, v):
        self.wal.append((k, v))
        self.memtable[k] = v
        if len(self.memtable) >= 10**6:
            self._flush()

    def _flush(self):
        sst = SSTable.from_dict(self.memtable)
        self.sstables.insert(0, sst)
        self.memtable = {}
        self.wal.rotate()

    def get(self, k):
        if k in self.memtable: return self.memtable[k]
        for sst in self.sstables:
            if k in sst.bloom:  # cheap probabilistic check
                v = sst.get(k)
                if v is not None: return v
        return None

# Compaction merges older SSTables; Cassandra, RocksDB, LevelDB all do this.
05

Deep dive — polyglot persistence, the reality

At scale, you don't pick one DB. You pick many. A realistic stack for a consumer social app:

  • Postgres — users, payments, account settings (relational + ACID)
  • Cassandra — feed posts, activity events (write volume + eventual OK)
  • Redis — session tokens, feed caches, rate limiters (sub-ms reads)
  • Elasticsearch — search (posts, users, hashtags)
  • S3 — media (images, video) with CDN in front
  • Kafka — event bus between all of the above

Each has one job it does 10× better than the alternatives. Trying to force Postgres to do search, or Elasticsearch to hold the ledger, loses ~90% of the specialized DB's value.

Operational reality

Running 6 databases means 6 on-call rotations, 6 failure modes, 6 backup strategies. Polyglot isn't free. Most companies pick 2–3 (Postgres + Redis + one of {Cassandra, Elasticsearch, Kafka}) and push each hard.

06

Used in problems

Every problem in this portfolio picks 2–5 DBs. The decision table in §4 is the framework; the specific choices are justified on each problem page.

Next up