Post-mortem · Deploy / operator error

Accidentally Deleted

An engineer, tired and debugging a replication issue at 11pm, ran rm -rf on the production primary database directory instead of the replica. Five of six backup methods were later discovered to have been silently failing for months. Lost 6 hours of user data.

Operator errorBackups failing silentlyData lossLive-streamed recovery
01

TL;DR

Spam attack made GitLab.com slow. An engineer debugging replication ran rm -rf on what they thought was the replica's data directory — it was the primary's. The moment they realized, they tried restoring from backups. Five of six backup systems had silently been broken for months; no one knew. The one working backup was 6 hours old. GitLab restored from that, publicly live-streaming the recovery to their credit. ~5,000 projects, ~5,000 comments, ~700 new users lost.

02

Timeline

  • Evening Jan 31 — GitLab.com under spam attack causing replication lag. Engineer decides to manually reset the replica and re-sync from primary.
  • 23:00 UTC — Engineer, on production console for primary DB (db1.cluster.gitlab.com) believing they're on the replica, runs rm -rf /var/opt/gitlab/postgresql/data.
  • 23:00:01 UTC — Realizes mistake ~1 second later. Only ~4.5 GB of ~300 GB remaining. Stops the rm.
  • 23:00–00:00 UTC — Attempted restore from LVM snapshots — not in use. From pg_dump cron — empty files from a months-old bug. From S3 snapshots — S3 bucket empty. From Azure disk snapshots — never enabled.
  • ~00:30 UTC — Discover a 6-hour-old snapshot of the staging database that was made by a lucky coincidence. This is the only usable restore.
  • 00:30 UTC Feb 1 → 18:00 UTC — GitLab public live-streams the recovery on YouTube. Transparent throughout.
  • ~18:00 UTC Feb 1 — Service restored from 6-hour-old snapshot. ~18 hours total outage.
03

Root cause

The trigger was human error — wrong terminal. The real root causes were the silently failing backups:

  1. pg_dump produced zero-byte files because of a Postgres version mismatch no one noticed.
  2. LVM snapshots were documented but never actually set up in production.
  3. S3 snapshots were enabled but the bucket was configured with a lifecycle rule that expired them daily.
  4. Azure disk snapshots weren't turned on for the DB disks.
  5. Replication was the fifth "backup" — but replication copies deletes, so it's not a backup.

Nobody had tested a restore from any of these in months. If any one engineer had tried, the broken backups would have surfaced — but no one did.

04

Blast radius

~18 hours of downtime. 6 hours of user data permanently lost: ~5,000 projects created or updated, ~5,000 comments, ~700 new users. No compensation was possible for data loss other than apology + changes in process. Reputational impact was actually softened by the transparency of the live-streamed recovery — many in the community appreciated the openness.

05

Lessons

  1. Backups that aren't tested are not backups. Automated periodic restore-and-verify is the only way to know. A dashboard of "time since last successful restore" per backup system is a must.
  2. Replication is not a backup. It propagates deletes and corruption. Point-in-time recovery from WAL or explicit snapshots is what catches "oops I deleted it."
  3. Dangerous commands need scripts with confirmations. "If I were going to delete 300 GB from a prod primary, what safeguards would I want?" should inform the answer. Tooling should surface "you are about to delete X TB of production data; type the hostname to confirm."
  4. Prominent terminal identification on production. Bright color on prod, bold prompt showing host + role. Human brains fail at 11pm.
06

Concepts in play