"Compute total revenue per day per product." Two answers exist:
Batch: at midnight, scan all of yesterday's orders, aggregate, write to dashboard. Latency: hours. Throughput: huge. Reproducible, cheap, easy.
Stream: every order event flows through a continuous pipeline that updates the running total in real time. Latency: seconds. Throughput: also huge. Complex, more failure modes, but live.
Same answer, very different architectures. Choose batch and you can't show "live revenue" to the dashboard. Choose stream and you pay 10× the operational complexity for analytics that don't need real-time. Most modern systems do both — Lambda / Kappa architectures formalize the split.
02
Spectrum of latency vs complexity
Approach
Latency
Re-run on bug
Tools
Batch (daily)
Hours-day
Trivial — re-run job on yesterday's data
Hadoop, Spark batch, dbt, BigQuery
Micro-batch (5 min)
Minutes
Re-run the affected windows
Spark Structured Streaming
Stream (event-time)
Seconds
Replay events from log; rebuild state
Flink, Kafka Streams, Beam
Synchronous (per-request)
Milliseconds
N/A — query computes on demand
OLTP DBs, materialized views
Latency vs Complexity SpectrumSVG
03
Why streaming is hard
Batch is easy because the input is fixed: yesterday's data sits in a file, you process it, write output, done. Streaming has a moving input.
Headaches that don't exist in batch:
Late events. An event from 10 minutes ago arrives now. Did your "last 5 minutes" aggregate already finalize? Watermarks + windowing fix this — Flink and Beam build their model around it.
Exactly-once. A pipeline restart should not double-count. Requires checkpointing of stream state + idempotent or transactional sinks (see delivery guarantees).
State. "Total revenue per product per minute" needs running sums. State must persist across restarts. Stream frameworks bake in keyed state stores backed by RocksDB.
Backfilling. A bug fix means re-processing the last 7 days. In batch this is one job. In streaming, you need to replay from Kafka offsets — possible if you kept retention long enough.
Reservoir sampling (uniform sample from stream)
import random
def reservoir_sample(stream, k):
"""Uniform sample of k items from unknown-length stream."""
sample = []
for i, item in enumerate(stream):
if i < k:
sample.append(item)
else:
# Each new item has k / (i+1) probability of replacing
j = random.randint(0, i)
if j < k:
sample[j] = item
return sample
# Property: after processing N items, each of the N had equal k/N probability
# of being in the final sample. O(1) memory, O(N) time.
04
Lambda vs Kappa architectures
Lambda architecture
Run both batch + stream layers
Batch layer = source of truth, recomputes from scratch nightly. Speed layer = streaming for live numbers. Serving layer merges both. Costs: two pipelines doing similar logic, kept in sync.
Kappa architecture
Stream is the only layer
Everything is a stream from Kafka. To "rebuild," replay the log from offset zero into a fresh consumer. One codebase, one pipeline. Wins when stream tools mature enough that you don't need batch's reprocessing simplicity.
Modern systems lean Kappa. Flink + Kafka can replay months of events in hours, recomputing whatever you want. Batch lives on for cost-efficiency on truly massive backfills (years of data) and for analyst-facing SQL where latency doesn't matter.
05
Deep dive — Flink's stateful streaming
Flink is the modern streaming gold standard. Its model:
Operators (map, filter, window, join) connected in a DAG.
Keyed state: each operator can hold per-key state in an embedded RocksDB. "Sum revenue per product_id" maintains a counter per product key, persisting locally.
Checkpoints: periodically (every few seconds), Flink snapshots all operator state + Kafka offsets in a coordinated barrier. On failure, rewinds to last checkpoint, replays Kafka from that offset, exactly-once semantics achieved.
Watermarks: Flink tracks "we've seen all events up to time T." Windows close based on watermarks, not wall clock — handles late events correctly.
Savepoints: manually-triggered snapshots. Use to upgrade Flink jobs without losing state.
This combination — per-key state + checkpointing + watermarks + replay — gives streaming the same correctness guarantees that batch always had. Fraud detection, real-time dashboards, anomaly alerts at Uber, Netflix, Pinterest all run on Flink.
06
Real-world
Apache Spark
Batch + micro-batch
The dominant batch engine. Structured Streaming module adds micro-batch streaming with similar API.
Apache Flink
True streaming
Event-time semantics, stateful processing, exactly-once. Used at Uber, Netflix, Alibaba for real-time analytics.
Kafka Streams
Library, not framework
JVM library that turns Kafka into a streaming platform. Lighter than Flink; fewer features.
dbt + Snowflake / BigQuery
Modern batch analytics
SQL-based batch transformations on cloud warehouses. Where most "analytics" actually lives — complement to streaming, not competitor.
07
Used in problems
Recommendation algorithm uses batch (nightly model retraining) + stream (live click events). Leaderboard streams scores in real time. Count active users uses streaming with HyperLogLog. Distributed logging is fundamentally a stream; analytics atop is batch.