Concept · Operations

Zero-Downtime Database Migration

01

Why this matters

You need to rename a column. Drop a table. Migrate from MySQL to Postgres. Add a NOT NULL field to a 1-billion-row table. Each of these can be a 30-minute outage if you do it naively — or zero downtime if you follow the right pattern. The difference is one principle: change the data layer in steps where old and new code coexist safely. Almost every "migration outage" story is a violation of this principle.

02

The expand-then-contract pattern

The single most important migration pattern. To rename column user_name to display_name:

  1. Expand: add the new column. Old code untouched.
  2. Backfill: copy user_namedisplay_name for existing rows. Run as a background job. Throttled.
  3. Dual-write: deploy code that writes to both columns on every change. Old + new always in sync.
  4. Switch reads: deploy code that reads from display_name. Old column still being written but unused.
  5. Stop dual-write: deploy code that only writes display_name.
  6. Contract: drop user_name column.

Each step is reversible. Each step is independently deployable. Each step works with the previous step's code. Zero downtime.

Cost: 6 deploys instead of 1, weeks of elapsed time. But every step is safe and traffic never sees inconsistency.

03

Common migration scenarios

MigrationPattern
Rename columnExpand-contract as above
Add NOT NULL columnAdd as nullable + default → backfill → enforce NOT NULL. NEVER add NOT NULL directly to a populated table.
Drop a columnStop reading first; deploy. Then stop writing; deploy. Then drop.
Change column typeAdd new column with new type → dual-write → backfill → switch reads → drop old
Split tableAdd new table → dual-write → backfill → switch reads → stop writes to old → drop old
Add index on huge tableUse CREATE INDEX CONCURRENTLY (Postgres) — non-blocking. Skip the lock.
Cross-DB migration (MySQL → Postgres)Same expand-contract but with CDC doing the dual-write
04

Deep dive — cross-database migration with CDC

You're moving from MySQL to Postgres. Different schemas, different connection strings, different SQL dialects. Naive: schedule a maintenance window, dump-restore, point app at new DB. Hours of downtime; data loss if anything fails.

The right way using CDC:

  1. Set up Postgres with target schema. Empty.
  2. Backfill: streaming export from MySQL → import to Postgres. Take hours/days; no downtime since MySQL still serves prod.
  3. Start CDC: Debezium tails MySQL binlog → applies changes to Postgres. New writes flow through. Postgres now mirrors MySQL with seconds of lag.
  4. Verify: run consistency checks. Row counts match? Sample rows match? Latency under control?
  5. Dual-write: deploy app version that writes to both MySQL and Postgres explicitly. Catches bugs the CDC layer might miss.
  6. Switch reads: deploy app reading from Postgres. MySQL still being written but unused.
  7. Stop writing to MySQL: deploy app writing only to Postgres.
  8. Decommission MySQL: archive, then delete.

Each step takes hours/days. The whole migration takes weeks. But traffic is served the entire time. This is how Stripe migrated billions of rows from MongoDB to Postgres; how Shopify continuously evolves its sharded MySQL fleet.

Interview answer

"Zero-downtime migrations follow expand-then-contract: add new schema, dual-write, backfill, switch reads, stop dual-write, drop old. Each step is independently deployable and reversible. Cross-database moves use CDC to keep target in sync during the dual-write window — Debezium tails the source's WAL/binlog and replays to the destination."

05

Common pitfalls

  • Adding NOT NULL directly on a populated table → table lock for hours on big tables → outage. Always: add nullable, backfill, then enforce.
  • Implicit type changes (VARCHAR(50) → VARCHAR(100) is fine; INT → BIGINT can rewrite the table). Test on a clone.
  • Adding an index without CONCURRENTLY → write lock. Postgres has CREATE INDEX CONCURRENTLY; MySQL has online DDL.
  • Foreign-key changes often re-validate the entire child table. Schedule during low traffic.
  • Skipping the dual-write step for "speed" → silent data inconsistency between writes that happened before and after the cut-over.
  • Forgetting to feature-flag the read switch → can't easily roll back to reading from old column when new column has bugs.
06

Real-world

GitHub

gh-ost (Online Schema Change)

MySQL schema migrator that creates a shadow table, dual-writes via triggers + CDC, swaps atomically. Used at GitHub for every schema change in their MySQL fleet.

PlanetScale

Branching for schema changes

Treats DB schema changes like git branches. Test in a branch; merge to main with verified migration. No downtime.

Stripe migrations

Billion-row online migrations

Heavy use of dual-write + CDC + feature flags. Public engineering blog documents the pattern.

AWS DMS

Cross-DB cloud-managed

Database Migration Service. Handles cross-engine migrations with continuous replication. Common for Oracle → Aurora migrations.

07

Used in problems

E-commerce platforms migrate constantly — schema changes, sharding splits, DB engine upgrades. News feed uses CDC for table-format changes. Payment gateway uses extreme caution with all schema changes — rollback paths must be explicit.

Next up