Concept · Databases

Data Lake vs Warehouse

01

Why this matters

"Where do analytics queries run?" If you say "Postgres," you've never had real analytics needs. Running aggregations over 10 TB of order history on your transactional DB tanks every API. Analytics belongs in a separate stack entirely — historically a data warehouse, more recently a data lake, and now a hybrid lakehouse. Picking wrong burns months and millions.

02

The three architectures

PatternStorageSchemaQueryCost
Data warehouseColumnar, proprietaryOn write (ETL transforms before load)SQL, very fast$$$ — storage + compute coupled
Data lakeObject store (S3, GCS), open formats (Parquet)On read (raw JSON / CSV / events)Spark, Athena, Trino — slower$ — storage cheap, compute on demand
LakehouseObject store + transactional metadata layer (Delta, Iceberg, Hudi)On write OR read; ACID transactionsBoth SQL and Spark$$ — best of both, more setup
03

Why warehouses exist

OLTP databases (Postgres, MySQL) store rows. Reading "SUM(revenue) by product over 2 years" reads every byte of every row to extract one column. Awful.

Warehouses store columnar: all values of column X stored together. Reading the revenue column scans only those bytes — 10-100× less I/O. Plus columnar compresses well (similar values cluster) and SIMD-vectorizes (one CPU instruction operates on many values).

Combine columnar storage with massively parallel processing (MPP), and you get sub-second queries over terabytes. Snowflake, BigQuery, Redshift, ClickHouse all do this.

04

Why data lakes appeared

Warehouses force schema-on-write — you ETL data into a star schema before it lands. Painful when:

  • You don't know what queries you'll run later (data scientists exploring).
  • Data is semi-structured (JSON event blobs, ML features, images).
  • Storage cost matters at petabyte scale (warehouses charge per TB-month).

Data lakes flip it: dump raw data into S3 in open formats (Parquet, JSON, Avro). Schema lives in a metastore (Hive Metastore, AWS Glue Catalog). Query engines (Spark, Athena, Trino, Presto) read the files at query time. Storage cheap; compute pay-per-query.

Trade-offs: queries slower than warehouse (data sits in S3, not loaded into a fast engine). No transactions until lakehouse formats arrived. Schema drift can corrupt downstream.

05

Deep dive — Iceberg, Delta, Hudi (the lakehouse wave)

Around 2018, three table formats appeared to add transactional semantics on top of S3-based lakes:

  • Apache Iceberg (Netflix-originated) — most widely supported, open standard.
  • Delta Lake (Databricks) — best Spark integration; now open-sourced.
  • Apache Hudi (Uber) — incremental processing, optimized for upsert-heavy workloads.

All three: maintain a metadata layer (manifest files, transaction log) on top of Parquet data files in S3. Provide:

  • ACID transactions over object-store data — concurrent writes don't corrupt each other.
  • Time travel — query the table as of yesterday/last week. Snapshot isolation.
  • Schema evolution — add/drop columns without rewriting every file.
  • Compaction + Z-ordering — physical layout optimization for query speed.
  • UPSERT / DELETE — features warehouses had, lakes lacked.

Result: a "lakehouse" — lake's cheap storage + warehouse's transactional + analytical guarantees. Snowflake, Databricks, BigQuery all support reading Iceberg directly. Many teams now skip dedicated warehouses, query S3+Iceberg directly with Trino / Athena / Snowflake.

Modern stack

"Operational data in Postgres. CDC to Kafka. Streaming jobs land into Iceberg tables on S3. Analysts query via Trino. ML pipelines read Parquet directly via Spark. One source of truth, multiple consumers."

06

Real-world

Snowflake / BigQuery / Redshift

Cloud warehouses

Decoupled storage + compute (mostly). SQL-only interface. Pay per query or per second of compute.

Databricks

Lakehouse pioneer

Spark + Delta Lake. Strong for ML + analytics combined. Increasingly competing with Snowflake.

Trino / Presto + S3

Open-source query engine

Federated SQL across S3 + Postgres + Mongo. What Netflix and Airbnb run for ad-hoc analytics.

ClickHouse

Real-time columnar

Self-hosted; sub-second on billions of rows. Used by Uber for real-time analytics, Cloudflare for log analytics.

07

Used in problems

Recommendation algorithm trains models on warehouse data. Leaderboard precomputes from warehouse aggregates. Count active users runs analytics on data lake (raw event blobs). Distributed logging tier-2 storage IS a data lake.

Next up