Concept · Data Structures

Memory-Mapped Files (mmap)

01

Why this matters

Reading a 100 GB file the normal way: read() in chunks, copy from kernel buffer to user buffer, process. Two copies per byte. Slow at scale. mmap maps the file's bytes directly into your process's virtual memory — accessing them is just pointer dereference, the OS pages in data on demand. Zero copies, the OS page cache becomes your cache, and reads are limited only by disk bandwidth.

It's the reason Kafka serves 500k msgs/sec per broker, why SQLite's read path is fast, and how LMDB / RocksDB's bbolt achieves microsecond reads on multi-GB datasets.

02

How mmap works

You ask the kernel: "map this file into my address space." Kernel allocates virtual address ranges; sets up page-table entries that point to nothing yet. When you access an address, a page fault triggers the kernel to read the relevant 4 KB page from disk into RAM, then fix up the page table. Subsequent accesses to the same page are pure memory reads — fast.

Three wins:

  • Zero copies. No buffer-to-buffer shuffling. Application code reads file data directly from the page cache.
  • Lazy loading. Map a 100 GB file, read only the bytes you need; the rest never touches RAM.
  • Shared cache. Two processes mapping the same file share the same physical pages. Memory-efficient.
03

Sendfile — the writes-to-network sibling

Kafka's other zero-copy trick. Normally, sending a file over a socket goes: disk → kernel buffer → user buffer → socket buffer → NIC. Four copies. sendfile(socket, file, offset, length) tells the kernel: "send these file bytes directly to that socket." Kernel does it in one DMA transfer, bypassing user space entirely.

Kafka brokers serve fetch requests this way — file segment bytes go straight from disk to network without ever entering Kafka's process memory. CPU isn't touched per byte. Throughput limited only by NIC bandwidth.

04

When mmap is wrong

mmap is brilliant for read-heavy access to large files. Less great for:

  • Heavy writes. A page fault on dirty data triggers a write-back. Random writes thrash the page cache. Use direct I/O or buffered write() for write-dominated paths.
  • Tiny files. Setup overhead (page table entries, syscalls) outweighs benefits.
  • Network filesystems. mmap over NFS is a minefield of consistency surprises.
  • 32-bit address spaces. Mapping > 2 GB doesn't fit. Mostly historical now.
  • Predictable latency requirements. Page faults are unpredictable — a "fast" memory access can take 10 ms if it triggers disk I/O. Latency-critical paths use explicit I/O.
05

Deep dive — Kafka's sequential mmap pattern

Kafka stores each topic-partition as a series of segment files (~1 GB each). Brokers mmap the index files but use plain sequential read+sendfile for the data segments. Why?

  • Indexes are small + random-access. Look up "offset 12345 → file position 4567." Index is ~10 MB, fits in RAM, mmap is perfect.
  • Data segments are huge + sequential. Read from offset 4567 to end. Sequential read is already kernel-optimized; mmap adds no benefit and costs page-table entries.

Combined with sendfile to the network: the broker's CPU is barely involved. A modern Kafka broker bottlenecks on NIC bandwidth, not on CPU or memory bandwidth — exactly what you want for a log-streaming server. This is what the "Kafka uses the OS" mantra means: every storage trick the kernel offers, Kafka exploits.

Interview one-liner

"For sequential append-and-read workloads, mmap + sendfile let you serve at line-rate without copying bytes through user space. Kafka's throughput comes from this — not from clever data structures, but from getting out of the kernel's way."

06

Real-world

Kafka

mmap + sendfile

Serves fetch requests at multi-GB/s per broker without copying bytes to user space.

LMDB / bbolt

Entire DB is mmap-ed

Read path is pointer arithmetic. ETCD's storage layer is bbolt; lookups are nanoseconds.

SQLite

Optional mmap mode

Set PRAGMA mmap_size=... to map the whole DB. Read-heavy workloads see 2–3× speedup.

Redis RDB / fork-copy

Copy-on-write snapshots

Redis's BGSAVE forks; child sees the entire memory map; only modified pages copy. Snapshot a 50 GB Redis with seconds of fork overhead.

07

Used in problems

Distributed logging uses mmap + sendfile under the hood (Kafka). YouTube/Netflix CDN edges use sendfile to serve video chunks at line-rate. Google Drive's chunk-server reads use mmap for hot chunks.

Next up