Concept · Databases

Change Data Capture (CDC)

01

Why this matters

Your transactional DB has the source-of-truth for orders. You also need: a search index in Elasticsearch, a denormalized read store in Redis, an analytics warehouse, a notifications pipeline triggered on order events. Five downstream systems that all need to react to DB changes.

Naive: have your app write to all 5 after each DB write. No atomicity. If the search-index write fails, you have inconsistency. Plus your app code is now coupled to every downstream.

Change Data Capture tails the database's own transaction log (WAL) and emits each change as a Kafka event. Every downstream consumes from Kafka independently. The DB stays the single source of truth; sync to N systems is automatic; your app code stays small.

02

How CDC works

Every relational DB writes a transaction log: Postgres WAL, MySQL binlog, Oracle redo log. The log records every row insert, update, delete in commit order.

CDC tools (Debezium most famously) open a streaming connection to this log. They translate each entry into a structured event: {table: "orders", op: "insert", before: null, after: {...}, ts: ...}. Events publish to Kafka.

Consumers do whatever they want. Search-indexer subscribes to orders topic and updates Elasticsearch. Analytics-pipeline subscribes and writes to the warehouse. Cache-invalidator subscribes and DELs Redis keys. None of them touch the source DB; none of them need app-code support.

03

CDC vs application-emitted events

App emits events

"After commit, publish to Kafka"

App is responsible. Easy to get right for happy path. Atomicity broken: commit succeeds, publish fails → silent inconsistency. Mitigate with the outbox pattern.

CDC tails the log

WAL is the source of events

If it's in the DB, it's in Kafka. Atomicity guaranteed by the DB itself — no separate publish step to fail. Apps don't change. Adding a new downstream = new consumer, no app deploy.

Outbox + CDC actually combines well: app writes to an "outbox" table in the same transaction, Debezium CDCs the outbox table, downstreams consume. Get app-defined events with CDC's atomicity guarantee.

Event-sourced aggregate rebuild
class BankAccount:
    """State is derived by replaying events; events are source of truth."""
    def __init__(self):
        self.balance = 0
        self.uncommitted = []

    def apply(self, event):
        if event["type"] == "deposit":
            self.balance += event["amount"]
        elif event["type"] == "withdraw":
            self.balance -= event["amount"]

    @classmethod
    def from_history(cls, events):
        account = cls()
        for e in events: account.apply(e)
        return account

    def deposit(self, amount):
        event = {"type": "deposit", "amount": amount}
        self.apply(event)           # apply to in-memory state
        self.uncommitted.append(event)

# Store events in append-only log. Rebuild state on load.
# Snapshots every N events keep replay time bounded.
04

Practical considerations

  • Schema evolution — when the source table changes shape, every downstream's parser needs to handle both old + new event shape. Schema registry (Confluent's, Apicurio) helps.
  • Snapshots — a new consumer that wants ALL data, not just changes from now, needs an initial snapshot. Debezium handles this: dump current rows, then start streaming.
  • Replica lag — CDC tails async, so downstreams are seconds behind the DB. Fine for analytics; problematic if the user just wrote and immediately reads from the search index.
  • WAL retention — Postgres throws away WAL segments after they're shipped. If the CDC consumer falls hours behind, WAL fills the disk. Monitor lag aggressively.
  • Throughput cost — every write becomes an event. Bursty workloads can flood Kafka. Plan partition counts accordingly.
05

Deep dive — Debezium architecture

Debezium is the dominant open-source CDC tool. Architecture:

  1. Connector — DB-specific. PostgresConnector, MySQLConnector, MongoConnector, etc. Knows how to subscribe to that DB's log.
  2. Kafka Connect runtime — manages the connector, handles offset tracking ("we've consumed up to WAL position 12345"), restarts on failure.
  3. Source DB — needs to be configured: Postgres needs wal_level=logical, MySQL needs row-based binlog. Postgres also needs replication slots so WAL doesn't get garbage-collected before Debezium reads it.
  4. Kafka topics — one per source table by convention. orders table → dbserver.public.orders topic.

Failure model: if Debezium dies, it resumes from the last-committed offset on restart. At-least-once delivery to Kafka — consumers must be idempotent (use the change's primary key + LSN as dedupe key).

Interview answer

"For event-driven downstream sync, we use Debezium to tail the Postgres WAL into Kafka topics. Consumers (search indexer, cache invalidator, analytics) react to changes without app-side instrumentation. Atomicity is guaranteed by the DB — we never have a 'commit succeeded but event lost' state."

06

Real-world

Debezium

Open-source CDC

Postgres, MySQL, Mongo, Oracle, SQL Server, Cassandra. The dominant open-source choice.

AWS DMS

Managed CDC + migration

Stream changes between heterogeneous DBs (e.g. Oracle → Aurora). Handles schema mapping.

Striim / Fivetran

Commercial CDC platforms

SaaS. Connects 100+ source/sink combinations. Heavy users in analytics ETL.

Postgres logical replication

Built-in CDC primitives

Can publish changes natively; many tools build on it (Debezium, pg_logical, etc.).

07

Used in problems

E-commerce uses CDC to sync orders to search/analytics/notification systems. News feed CDCs the posts table to feed-fanout workers. Payment gateway CDCs the ledger to webhook delivery. Distributed logging IS a CDC pattern at scale.

Next up