05
Deep dive — Iceberg, Delta, Hudi (the lakehouse wave)
Around 2018, three table formats appeared to add transactional semantics on top of S3-based lakes:
- Apache Iceberg (Netflix-originated) — most widely supported, open standard.
- Delta Lake (Databricks) — best Spark integration; now open-sourced.
- Apache Hudi (Uber) — incremental processing, optimized for upsert-heavy workloads.
All three: maintain a metadata layer (manifest files, transaction log) on top of Parquet data files in S3. Provide:
- ACID transactions over object-store data — concurrent writes don't corrupt each other.
- Time travel — query the table as of yesterday/last week. Snapshot isolation.
- Schema evolution — add/drop columns without rewriting every file.
- Compaction + Z-ordering — physical layout optimization for query speed.
- UPSERT / DELETE — features warehouses had, lakes lacked.
Result: a "lakehouse" — lake's cheap storage + warehouse's transactional + analytical guarantees. Snowflake, Databricks, BigQuery all support reading Iceberg directly. Many teams now skip dedicated warehouses, query S3+Iceberg directly with Trino / Athena / Snowflake.
Modern stack
"Operational data in Postgres. CDC to Kafka. Streaming jobs land into Iceberg tables on S3. Analysts query via Trino. ML pipelines read Parquet directly via Spark. One source of truth, multiple consumers."