Concept · Messaging & Async

Batch vs Stream Processing

01

Why this matters

"Compute total revenue per day per product." Two answers exist:

  • Batch: at midnight, scan all of yesterday's orders, aggregate, write to dashboard. Latency: hours. Throughput: huge. Reproducible, cheap, easy.
  • Stream: every order event flows through a continuous pipeline that updates the running total in real time. Latency: seconds. Throughput: also huge. Complex, more failure modes, but live.

Same answer, very different architectures. Choose batch and you can't show "live revenue" to the dashboard. Choose stream and you pay 10× the operational complexity for analytics that don't need real-time. Most modern systems do both — Lambda / Kappa architectures formalize the split.

02

Spectrum of latency vs complexity

ApproachLatencyRe-run on bugTools
Batch (daily)Hours-dayTrivial — re-run job on yesterday's dataHadoop, Spark batch, dbt, BigQuery
Micro-batch (5 min)MinutesRe-run the affected windowsSpark Structured Streaming
Stream (event-time)SecondsReplay events from log; rebuild stateFlink, Kafka Streams, Beam
Synchronous (per-request)MillisecondsN/A — query computes on demandOLTP DBs, materialized views
Latency vs Complexity Spectrum SVG
1 ms 1 s 1 min 1 hr 1 day end-to-end latency (log scale) → Sync per-request OLTP DB · materialized views Stream event-time Flink · Kafka Streams · Beam Micro-batch 5 min Spark Structured Streaming Batch nightly Spark · Hadoop · dbt · BigQuery low complexity ← → high complexity
03

Why streaming is hard

Batch is easy because the input is fixed: yesterday's data sits in a file, you process it, write output, done. Streaming has a moving input.

Headaches that don't exist in batch:

  • Late events. An event from 10 minutes ago arrives now. Did your "last 5 minutes" aggregate already finalize? Watermarks + windowing fix this — Flink and Beam build their model around it.
  • Exactly-once. A pipeline restart should not double-count. Requires checkpointing of stream state + idempotent or transactional sinks (see delivery guarantees).
  • State. "Total revenue per product per minute" needs running sums. State must persist across restarts. Stream frameworks bake in keyed state stores backed by RocksDB.
  • Backfilling. A bug fix means re-processing the last 7 days. In batch this is one job. In streaming, you need to replay from Kafka offsets — possible if you kept retention long enough.
Reservoir sampling (uniform sample from stream)
import random

def reservoir_sample(stream, k):
    """Uniform sample of k items from unknown-length stream."""
    sample = []
    for i, item in enumerate(stream):
        if i < k:
            sample.append(item)
        else:
            # Each new item has k / (i+1) probability of replacing
            j = random.randint(0, i)
            if j < k:
                sample[j] = item
    return sample

# Property: after processing N items, each of the N had equal k/N probability
# of being in the final sample. O(1) memory, O(N) time.
04

Lambda vs Kappa architectures

Lambda architecture

Run both batch + stream layers

Batch layer = source of truth, recomputes from scratch nightly. Speed layer = streaming for live numbers. Serving layer merges both. Costs: two pipelines doing similar logic, kept in sync.

Kappa architecture

Stream is the only layer

Everything is a stream from Kafka. To "rebuild," replay the log from offset zero into a fresh consumer. One codebase, one pipeline. Wins when stream tools mature enough that you don't need batch's reprocessing simplicity.

Modern systems lean Kappa. Flink + Kafka can replay months of events in hours, recomputing whatever you want. Batch lives on for cost-efficiency on truly massive backfills (years of data) and for analyst-facing SQL where latency doesn't matter.

05

Deep dive — Flink's stateful streaming

Flink is the modern streaming gold standard. Its model:

  • Operators (map, filter, window, join) connected in a DAG.
  • Keyed state: each operator can hold per-key state in an embedded RocksDB. "Sum revenue per product_id" maintains a counter per product key, persisting locally.
  • Checkpoints: periodically (every few seconds), Flink snapshots all operator state + Kafka offsets in a coordinated barrier. On failure, rewinds to last checkpoint, replays Kafka from that offset, exactly-once semantics achieved.
  • Watermarks: Flink tracks "we've seen all events up to time T." Windows close based on watermarks, not wall clock — handles late events correctly.
  • Savepoints: manually-triggered snapshots. Use to upgrade Flink jobs without losing state.

This combination — per-key state + checkpointing + watermarks + replay — gives streaming the same correctness guarantees that batch always had. Fraud detection, real-time dashboards, anomaly alerts at Uber, Netflix, Pinterest all run on Flink.

06

Real-world

Apache Spark

Batch + micro-batch

The dominant batch engine. Structured Streaming module adds micro-batch streaming with similar API.

Apache Flink

True streaming

Event-time semantics, stateful processing, exactly-once. Used at Uber, Netflix, Alibaba for real-time analytics.

Kafka Streams

Library, not framework

JVM library that turns Kafka into a streaming platform. Lighter than Flink; fewer features.

dbt + Snowflake / BigQuery

Modern batch analytics

SQL-based batch transformations on cloud warehouses. Where most "analytics" actually lives — complement to streaming, not competitor.

07

Used in problems

Recommendation algorithm uses batch (nightly model retraining) + stream (live click events). Leaderboard streams scores in real time. Count active users uses streaming with HyperLogLog. Distributed logging is fundamentally a stream; analytics atop is batch.

Next up