Batch vs Stream Processing

01

Why this matters

"Compute total revenue per day per product." Two answers exist:

Batch: at midnight, scan all of yesterday's orders, aggregate, write to dashboard. Latency: hours. Throughput: huge. Reproducible, cheap, easy.
Stream: every order event flows through a continuous pipeline that updates the running total in real time. Latency: seconds. Throughput: also huge. Complex, more failure modes, but live.

Same answer, very different architectures. Choose batch and you can't show "live revenue" to the dashboard. Choose stream and you pay 10× the operational complexity for analytics that don't need real-time. Most modern systems do both — Lambda / Kappa architectures formalize the split.

02

Spectrum of latency vs complexity

Approach	Latency	Re-run on bug	Tools
Batch (daily)	Hours-day	Trivial — re-run job on yesterday's data	Hadoop, Spark batch, dbt, BigQuery
Micro-batch (5 min)	Minutes	Re-run the affected windows	Spark Structured Streaming
Stream (event-time)	Seconds	Replay events from log; rebuild state	Flink, Kafka Streams, Beam
Synchronous (per-request)	Milliseconds	N/A — query computes on demand	OLTP DBs, materialized views

Latency vs Complexity Spectrum SVG

03

Why streaming is hard

Batch is easy because the input is fixed: yesterday's data sits in a file, you process it, write output, done. Streaming has a moving input.

Headaches that don't exist in batch:

Late events. An event from 10 minutes ago arrives now. Did your "last 5 minutes" aggregate already finalize? Watermarks + windowing fix this — Flink and Beam build their model around it.
Exactly-once. A pipeline restart should not double-count. Requires checkpointing of stream state + idempotent or transactional sinks (see delivery guarantees).
State. "Total revenue per product per minute" needs running sums. State must persist across restarts. Stream frameworks bake in keyed state stores backed by RocksDB.
Backfilling. A bug fix means re-processing the last 7 days. In batch this is one job. In streaming, you need to replay from Kafka offsets — possible if you kept retention long enough.

Reservoir sampling (uniform sample from stream)

import random

def reservoir_sample(stream, k):
    """Uniform sample of k items from unknown-length stream."""
    sample = []
    for i, item in enumerate(stream):
        if i < k:
            sample.append(item)
        else:
            # Each new item has k / (i+1) probability of replacing
            j = random.randint(0, i)
            if j < k:
                sample[j] = item
    return sample

# Property: after processing N items, each of the N had equal k/N probability
# of being in the final sample. O(1) memory, O(N) time.

04

Lambda vs Kappa architectures

Lambda architecture

Run both batch + stream layers

Batch layer = source of truth, recomputes from scratch nightly. Speed layer = streaming for live numbers. Serving layer merges both. Costs: two pipelines doing similar logic, kept in sync.

Kappa architecture

Stream is the only layer

Everything is a stream from Kafka. To "rebuild," replay the log from offset zero into a fresh consumer. One codebase, one pipeline. Wins when stream tools mature enough that you don't need batch's reprocessing simplicity.

Modern systems lean Kappa. Flink + Kafka can replay months of events in hours, recomputing whatever you want. Batch lives on for cost-efficiency on truly massive backfills (years of data) and for analyst-facing SQL where latency doesn't matter.

05

Deep dive — Flink's stateful streaming

Flink is the modern streaming gold standard. Its model:

Operators (map, filter, window, join) connected in a DAG.
Keyed state: each operator can hold per-key state in an embedded RocksDB. "Sum revenue per product_id" maintains a counter per product key, persisting locally.
Checkpoints: periodically (every few seconds), Flink snapshots all operator state + Kafka offsets in a coordinated barrier. On failure, rewinds to last checkpoint, replays Kafka from that offset, exactly-once semantics achieved.
Watermarks: Flink tracks "we've seen all events up to time T." Windows close based on watermarks, not wall clock — handles late events correctly.
Savepoints: manually-triggered snapshots. Use to upgrade Flink jobs without losing state.

This combination — per-key state + checkpointing + watermarks + replay — gives streaming the same correctness guarantees that batch always had. Fraud detection, real-time dashboards, anomaly alerts at Uber, Netflix, Pinterest all run on Flink.

06

Real-world

Apache Spark

Batch + micro-batch

The dominant batch engine. Structured Streaming module adds micro-batch streaming with similar API.

Apache Flink

True streaming

Event-time semantics, stateful processing, exactly-once. Used at Uber, Netflix, Alibaba for real-time analytics.

Kafka Streams

Library, not framework

JVM library that turns Kafka into a streaming platform. Lighter than Flink; fewer features.

dbt + Snowflake / BigQuery

Modern batch analytics

SQL-based batch transformations on cloud warehouses. Where most "analytics" actually lives — complement to streaming, not competitor.

07

Used in problems

Recommendation algorithm uses batch (nightly model retraining) + stream (live click events). Leaderboard streams scores in real time. Count active users uses streaming with HyperLogLog. Distributed logging is fundamentally a stream; analytics atop is batch.

📺