Concept · Databases

Write-Ahead Log (WAL)

01

Why this matters

You commit a transaction. Two milliseconds later, the power fails. On reboot, is the commit there? If yes, your DB is durable. If no, it's a lie. The trick every durable datastore uses — Postgres, MySQL, Cassandra, SQLite, Kafka — is the write-ahead log: before any change touches the main data files, write the intent to an append-only log on disk and fsync(). Crash? Replay the log, recover the state. This single technique powers the D in ACID.

02

How it works

Every modification follows this sequence:

  1. Write intent to WAL — an append-only sequential file. Fast, because disks are great at sequential writes.
  2. fsync() — ensure the log entry is actually on disk, not just in OS cache. This is the slow part; the source of commit-latency floor (~1–5ms on NVMe, ~10ms on HDD).
  3. Acknowledge — return success to the client. The operation is durable.
  4. Apply to data files — eventually, update the main B-tree / SSTables. Can happen lazily in the background.

On crash recovery: read the WAL from the last known-good point. Replay every committed entry. Main data files catch up. System is consistent again.

03

Why sequential writes are a superpower

Random disk writes (updating a B-tree leaf in place) are slow because the disk arm (HDD) or SSD controller has to seek. Sequential writes are fast — just append to the end.

An HDD does ~100 random IOPS but ~100 MB/s sequential. A modern SSD does ~100k random IOPS but ~3 GB/s sequential. Every generation of storage has this gap.

WAL exploits this: writes are always sequential. The expensive random updates to the main data structures happen in big batches, asynchronously, when the system is idle. Commit latency = one fsync + sequential append. Orders of magnitude faster than "update the B-tree in place and fsync."

Write-ahead log append + flush
import json, os, time

class WAL:
    def __init__(self, path, sync_every_ms=5):
        self.f = open(path, "ab")
        self.sync_every_ms = sync_every_ms
        self.last_sync = time.monotonic() * 1000

    def append(self, record):
        line = json.dumps(record) + "\n"
        self.f.write(line.encode())
        now = time.monotonic() * 1000
        if now - self.last_sync >= self.sync_every_ms:
            self.f.flush()
            os.fsync(self.f.fileno())
            self.last_sync = now

# Group commit: batch records between fsyncs. Trade durability window
# (bounded by sync_every_ms) for dramatically higher write throughput.
04

Where you see WAL

SystemWAL nameRole
PostgresWALDurability + replication streaming (replicas tail the WAL)
MySQL / InnoDBredo logDurability + crash recovery
SQLiteWAL modeEnable with PRAGMA journal_mode=WAL. Better concurrency than rollback journal.
Cassandra / RocksDBCommit logCrash recovery for the memtable before it flushes to SSTable
KafkaThe whole topic IS the logMessages are the log. No separate WAL; the log is the source of truth.
etcd / RaftLogEvery committed operation appended; replicated to quorum; survives any single-node crash.
05

Deep dive — group commit

Each fsync costs 1–5ms. If you fsync per transaction, max throughput = 200–1000 commits/sec. That's the durable-write ceiling for a single thread.

Group commit trick: batch multiple transactions into one fsync. N concurrent committers each write their WAL entries to the same buffer. One thread fsyncs. All N get their ack at once. Throughput goes from ~500 TPS to 50k+ TPS. Per-commit latency only rises slightly (you wait for the batch to flush).

This is why high-concurrency durable systems handle 10k+ commits/sec: the fsync cost amortizes across many transactions. Postgres (with synchronous_commit=on + concurrent clients), MySQL group commit, and Kafka's batching are all variants.

Interview one-liner

"Durability via write-ahead log: every commit appends to a sequential log and fsyncs before ACK. Reads serve from memory / B-trees; the log only matters on crash recovery. Group commit amortizes fsync cost across concurrent transactions."

06

Real-world

Postgres streaming replication

Replicas tail the WAL

Ship WAL records to replicas; replicas apply them. One log, two purposes: durability + replication.

Debezium / change-data-capture

Tail the WAL for events

CDC tools read the WAL directly to emit change events to Kafka. Every DB write becomes an event without app-side instrumentation.

Kafka

Log-as-database

Kafka inverts the model — the log is the database. "Turn your database inside out." Source of event-sourcing / CQRS architectures.

Raft / etcd / Consul

Replicated log

Every consensus-backed store is a replicated WAL. Writes appended; majority confirms; state machine applies.

07

Used in problems

Payment gateway uses WAL for the ledger (every transaction atomic). E-commerce order placement uses WAL via Postgres. Google Drive uses append-only logs for file-operation history. Distributed logging literally IS a giant WAL.

Next up