Concept · Databases

Normalization vs Denormalization

01

Why this matters

Your data model has two extremes. Fully normalized: every fact lives in exactly one place, connected by foreign keys. Great for writes (update once, seen everywhere), bad for reads (join 5 tables to render one page). Fully denormalized: each row has everything needed to render a view. Great for reads (single fetch), bad for writes (update in 50 places when a user changes their name).

Every real system lives somewhere in between. Getting the balance right is the difference between a snappy app with cheap writes and a slow app with terrifying update cascades.

02

The two extremes

Normalized (3NF)

Single source of truth per fact

Tables: users, posts, comments, users_blocked_users. A post row stores author_id — to render it you join users. Change a username once, every post display updates. Writes: cheap. Reads: multi-join.

Denormalized

Each row is a self-contained view

Post row stores author_id, author_name, author_avatar_url. Render with one read. Change a username: rewrite every post that user wrote. Reads: O(1). Writes: O(posts_by_user).

03

When to denormalize

  1. Read-heavy workload. Reads outnumber writes 10:1 or more → denormalize the hot read path.
  2. Fixed access patterns. You always render "feed post with author." Then store "feed post with author" as a single row in the read store. Don't reinvent the join on every request.
  3. Rarely-changing copied data. Username changes once a year; author_name copied to 10k post rows is fine. Don't copy data that changes every minute.
  4. Data you can't afford to join across. Across shards, joins are application logic — either denormalize or fetch twice.
  5. Wide-column stores demand it. Cassandra, DynamoDB explicitly recommend denormalization — they can't join, period.
Decision Flowchart — Normalize or Denormalize? Mermaid
flowchart TD A[New table / new query path] A --> B{Read:write ratio > 10:1?} B -->|No| N1[Stay normalized · 3NF] B -->|Yes| C{Source data changes < 1×/day?} C -->|No| N2[Stay normalized · denorm too volatile] C -->|Yes| D{Cross-shard join needed?} D -->|Yes| E[Denormalize · copy fields inline] D -->|No| F{Latency budget < 10ms?} F -->|Yes| E F -->|No| N3[Stay normalized · join is fast enough] style E fill:#e4f5ec,stroke:#1a5c38 style N1 fill:#f0eeeb,stroke:#878480 style N2 fill:#f0eeeb,stroke:#878480 style N3 fill:#f0eeeb,stroke:#878480
10:1
read:write threshold for denorm
< 1/day
source-change rate that's safe to copy
10 ms
latency budget where join hurts
typical write amplification per copy
04

The four patterns

PatternHowExample
Copy columnsStore the foreign object's frequently-read fields inlinecomment table stores author_name, author_avatar_url
Pre-computed aggregateMaintain a counter/sum/count alongside the source datapost.like_count column updated on every like/unlike
Materialized viewSecond table derived from the first, refreshed asynctrending_posts table recomputed every 5 min
Read-side projection (CQRS)Separate store shaped for reads, written via events from source-of-truth storeWrite to Postgres; projection writes to Elasticsearch
05

What denormalization costs

  • Update cascades. User changes their name → rewrite N rows. If N is big (celebrity with 10M posts), this is a background job, not a synchronous write.
  • Potential inconsistency. Update the source, then update 10 copies. If one update fails, the system is briefly inconsistent. Idempotent retries + eventual consistency.
  • Storage. Copying a 200-byte user record to 10M posts = 2GB of duplicate bytes. Usually fine; occasionally significant.
  • Schema evolution. Renaming a column means rewriting the copies. Slower to change.
The working rule

Start normalized (3NF). When a specific query becomes slow, denormalize that query's data, not the whole schema. Treat denormalization as an optimization, not a starting stance.

06

Deep dive — counters, the classic denormalization

You want to show "posts by this user: 42" on their profile. Three options:

Compute from source: SELECT COUNT(*) FROM posts WHERE user_id = 5. Always right. Costs an index scan per profile view. OK at small scale; dies at scale.

Denormalized counter: users.post_count column, incremented on every post insert. One extra write per post, zero on read. Fast but drift-prone: failed increment → counter is off forever. Crashed replica → counter is off forever. Easy to corrupt.

Periodic reconciliation: keep the counter, but also run a nightly job UPDATE users SET post_count = (SELECT COUNT(*)...) . Counter drifts during the day; corrects at night. Best-of-both.

For financial data: never trust a denormalized counter. Reconcile from source every transaction. For social media likes: denormalized counter + occasional reconciliation is industry standard.

07

Real-world

Twitter home feed

Fan-out on write = denormalization

Each tweet copied into every follower's feed cache. Write amplification of 100× for normal users; 10M× for celebrities. Pure denormalization for read speed.

Amazon DynamoDB model

Single-table design

Pre-compute the joins. One table holds orders + line items + customer fields, accessed by a single composite key.

Elasticsearch

Pure denormalization

Each document fully self-contained. No joins. Excellent reads, painful writes when source objects change.

Stripe

Normalized with materialized views

Charges and refunds normalized in Postgres. Analytics rollups materialized in a separate store, updated by events.

08

Used in problems

News feed denormalizes by fan-out on write. Reddit denormalizes comment counts onto posts. E-commerce denormalizes product name + price onto cart line items (price at time of purchase).

Next up