Post-mortem · Deploy / operator error

Accidentally Deleted

An engineer, tired and debugging a replication issue at 11pm, ran rm -rf on the production primary database directory instead of the replica. Five of six backup methods were later discovered to have been silently failing for months. Lost 6 hours of user data.

Operator errorBackups failing silentlyData lossLive-streamed recovery

TL;DR

Spam attack made GitLab.com slow. An engineer debugging replication ran rm -rf on what they thought was the replica's data directory — it was the primary's. The moment they realized, they tried restoring from backups. Five of six backup systems had silently been broken for months; no one knew. The one working backup was 6 hours old. GitLab restored from that, publicly live-streaming the recovery to their credit. ~5,000 projects, ~5,000 comments, ~700 new users lost.

Timeline

Evening Jan 31 — GitLab.com under spam attack causing replication lag. Engineer decides to manually reset the replica and re-sync from primary.
23:00 UTC — Engineer, on production console for primary DB (db1.cluster.gitlab.com) believing they're on the replica, runs rm -rf /var/opt/gitlab/postgresql/data.
23:00:01 UTC — Realizes mistake ~1 second later. Only ~4.5 GB of ~300 GB remaining. Stops the rm.
23:00–00:00 UTC — Attempted restore from LVM snapshots — not in use. From pg_dump cron — empty files from a months-old bug. From S3 snapshots — S3 bucket empty. From Azure disk snapshots — never enabled.
~00:30 UTC — Discover a 6-hour-old snapshot of the staging database that was made by a lucky coincidence. This is the only usable restore.
00:30 UTC Feb 1 → 18:00 UTC — GitLab public live-streams the recovery on YouTube. Transparent throughout.
~18:00 UTC Feb 1 — Service restored from 6-hour-old snapshot. ~18 hours total outage.

Root cause

The trigger was human error — wrong terminal. The real root causes were the silently failing backups:

pg_dump produced zero-byte files because of a Postgres version mismatch no one noticed.
LVM snapshots were documented but never actually set up in production.
S3 snapshots were enabled but the bucket was configured with a lifecycle rule that expired them daily.
Azure disk snapshots weren't turned on for the DB disks.
Replication was the fifth "backup" — but replication copies deletes, so it's not a backup.

Nobody had tested a restore from any of these in months. If any one engineer had tried, the broken backups would have surfaced — but no one did.

Blast radius

~18 hours of downtime. 6 hours of user data permanently lost: ~5,000 projects created or updated, ~5,000 comments, ~700 new users. No compensation was possible for data loss other than apology + changes in process. Reputational impact was actually softened by the transparency of the live-streamed recovery — many in the community appreciated the openness.

Lessons

Backups that aren't tested are not backups. Automated periodic restore-and-verify is the only way to know. A dashboard of "time since last successful restore" per backup system is a must.
Replication is not a backup. It propagates deletes and corruption. Point-in-time recovery from WAL or explicit snapshots is what catches "oops I deleted it."
Dangerous commands need scripts with confirmations. "If I were going to delete 300 GB from a prod primary, what safeguards would I want?" should inform the answer. Tooling should surface "you are about to delete X TB of production data; type the hostname to confirm."
Prominent terminal identification on production. Bright color on prod, bold prompt showing host + role. Human brains fail at 11pm.

Concepts in play

Backup + restore — test the restore path, not the backup pipeline.
Replication — why replication is not backup.
Disaster recovery — RPO measured in hours reflects actual recovery capability, not promised.
Blast radius — destructive ops need guardrails.
Incident response — live-streaming the recovery was a masterclass in transparency.