System Design — 016

Distributed Job Scheduler

How do you reliably execute millions of scheduled tasks at precisely the right time across a fleet of unreliable machines, ensuring no job is missed and no job runs twice?

distributed-systemsexactly-onceleader-electionkafkatime-based
01

Problem Statement

Think of cron on Linux — you write a crontab entry, the daemon wakes up every minute, checks if anything is due, and runs it. Works perfectly on one machine. Now imagine you need to send 50M promotional notifications at exactly 9:00 AM, expire 200K flash-deal prices at noon, and generate settlement reports for 100K sellers every night. One cron box can't handle this. If it dies, everything stops. If you add a second box, the same job runs twice.

That's the problem: cron, but distributed, reliable, and scalable. We're building the engine behind systems like Google Cloud Scheduler, AWS EventBridge Scheduler, or internal job scheduling infrastructure at companies like Uber and Amazon.

Core question: How do you guarantee every scheduled job fires exactly once at the right time, across machines that can crash at any moment?

The Core Tensions

Precision vs Scale

You want jobs at exactly T+0. But 4.2M jobs at midnight can't all fire at T+0. How much jitter is acceptable?

Reliability vs Simplicity

Multiple scheduler nodes means coordination to avoid duplicate execution. Coordination means complexity and new failure modes.

Exactly-Once vs Performance

Distributed locking on every job pickup is expensive. At-least-once with idempotent jobs pushes complexity to job authors.

Flexibility vs Predictability

Supporting priorities, retries, cron expressions, one-time and recurring jobs means a complex state machine with more edge cases.

Single-Machine Baseline

The simplest version: a jobs table in PostgreSQL, a ticker process that runs every second querying WHERE next_run_at <= NOW(), and inline execution. This breaks in four ways at scale: ticker dies (no jobs run), slow jobs block fast ones (backlog), millions due at midnight (query chokes), and two tickers (duplicate execution). Every distributed component exists to solve one of these four breakdowns.

02

Requirements

Functional Requirements

  • Create jobs — one-time ("run at time T") and recurring ("cron expression + timezone"). The scheduler is a trigger mechanism, not a compute platform — it fires webhooks or queue messages, not arbitrary code.
  • CRUD + Pause/Resume — update schedule, change priority, pause recurring jobs (resume from "now", not catchup), soft-delete with audit trail.
  • Execution history — last N runs per job with status, duration, error, retry count. Essential for debugging and trust.
  • Retry on failure — configurable max retries, backoff strategy (fixed, exponential, exponential with jitter). System detects worker crashes via lease expiry.
  • Priority support — CRITICAL > HIGH > MEDIUM > LOW. Payment reconciliation must not wait behind analytics reports.
  • Completion notifications — webhook callback on success, failure, or DLQ entry.

Non-Functional Requirements

  • Effectively exactly-once execution — at-least-once triggering + three layers of deduplication (lease, fencing, idempotency). The rabbit hole for this problem.
  • Low trigger jitter < 5 seconds — a job scheduled for 09:00:00 should fire by 09:00:05. This constrains polling frequency and pushes toward pre-fetching.
  • High availability — survives single node crash, DB failover, network partition. Missed jobs during outage execute on recovery (configurable catchup window).
  • Scale to 10M+ registered jobs, 50K triggers/sec at peak — rules out single-node designs. Job store, tick mechanism, and worker fleet all scale horizontally.
  • Durability — job definitions and execution history survive crashes. In-memory-only schedulers are unacceptable.
  • Multi-tenancy and fairness — tenant A's 5M midnight jobs must not starve tenant B's 10 critical jobs.

The impossible triangle: Exactly-once + High availability + Low jitter. Every design decision is picking a side of this triangle. We prioritize availability (never miss) and low jitter, then layer defenses for exactly-once.

Explicit Non-Goals

  • Not a workflow/DAG engine — no job dependencies (Job B after Job A). That's Airflow/Temporal territory.
  • Not a compute platform — we trigger jobs, we don't run arbitrary code. Workers are external.
  • Not sub-second precision — second-level granularity is our floor.
03

Scale Estimation

We're designing for a platform-scale scheduler — 10M registered jobs as the primary target, with a path to 100M.

~50K/s
Peak triggers per second
4.2M
Midnight spike (one second)
21 GB
Job metadata (fits single DB)
5.4 TB
Execution history per month
1.5M
In-flight jobs at peak
100 MB/s
Dispatch bandwidth

Derivation

10M jobs: 70% recurring (mix of 1-min to weekly), 30% one-time. Recurring triggers sum to ~900M/day ≈ 10K/sec average. Peak-to-average ratio of 5× gives 50K triggers/sec at peak. The thundering herd at midnight: 30% hourly + 30% daily jobs all due at 00:00:00 = 4.2M jobs in one second. This single number forces pre-fetching — you cannot query 4.2M rows from Postgres in real-time. Job metadata at 2KB each = 21 GB (fits Postgres). Execution history at 200 bytes × 900M/day × 30 days = 5.4 TB/month — needs a separate store.

Key insight: Storage is not the bottleneck — the query pattern is. "Give me all jobs where next_run_at <= NOW()" hits the same index every second, and at top-of-hour the result set explodes. This pushes us from "poll the DB" to "pre-load into a queue."

04

API Design

Two entities: Job (the definition — long-lived) and Execution (a single run — short-lived). Recurring jobs stay SCHEDULED and spawn execution instances. All write operations require an Idempotency-Key header.

Job Lifecycle State Machine

SCHEDULED → QUEUED → RUNNING → SUCCEEDED
                        ↓
                     FAILED → RETRYING → QUEUED
                        ↓
                  DEAD (max retries exhausted)

PAUSED  (out of scheduling loop, resume → SCHEDULED)
CANCELLED (terminal, no further executions)
Create a Recurring Job
POST /api/v1/jobs
Headers: Idempotency-Key: "client-uuid-123"

{
  "name": "nightly-seller-report",
  "type": "RECURRING",
  "schedule": {
    "cron": "0 2 * * *",
    "timezone": "Asia/Kolkata",
    "starts_at": "2026-04-07T00:00:00Z"
  },
  "action": {
    "type": "WEBHOOK",
    "url": "https://seller-service.internal/reports/generate",
    "method": "POST",
    "headers": { "Authorization": "Bearer {{secret_ref}}" },
    "body": { "report_type": "settlement" }
  },
  "retry_policy": {
    "max_retries": 3,
    "backoff": "EXPONENTIAL_WITH_JITTER",
    "initial_delay_seconds": 30
  },
  "priority": "HIGH",
  "tenant_id": "seller-platform"
}

→ 201 Created
{ "job_id": "job_8xk2m9f3", "status": "SCHEDULED",
  "next_run_at": "2026-04-08T02:00:00+05:30" }

Timezone Handling

Store cron + timezone, compute next_run_at in UTC at evaluation time — not at creation. Pre-computing UTC times for future dates gets DST transitions wrong.

Version-Based Invalidation

Every job has a version field, bumped on update. Queued executions carry the version at queue time. Workers check: if stale, discard.

Worker Internal API — Pull-Based
POST /internal/v1/executions/claim
{ "worker_id": "worker-us-east-42", "capacity": 10 }

POST /internal/v1/executions/heartbeat
{ "worker_id": "worker-us-east-42",
  "execution_ids": ["exec_1", "exec_2"] }

POST /internal/v1/executions/{id}/complete
{ "status": "SUCCEEDED", "duration_ms": 34200 }
05

High-Level Architecture

Every component exists because a specific number or failure mode forced it. The single-machine baseline breaks in four ways — each component solves one.

API Service Stateless Job Store PostgreSQL Scheduler Leader-Elected ZooKeeper Leader Election Dispatch Queue Kafka (priority topics) Worker Fleet Stateless Agents Lease Store Redis Execution Ledger Cassandra Dead Letter Queue Kafka Topic Notifications Async Service Outbox Relay Atomicity Guarantee CRUD Pre-fetch Election Dispatch Pull Lease Result Failed Alert Outbox Publish
ComponentTechWhyScaling
API ServiceStateless appUser-facing CRUD, decoupled from scheduling pathHorizontal, behind LB
Job StorePostgreSQL21 GB fits one DB, strong consistency for state transitionsPrimary-replica
SchedulerLeader-electedPre-fetches due jobs every 30s, prevents duplicate pre-fetchSingle leader + standbys
Dispatch QueueKafka50K/sec, durable, partitioned by priority + tenantAdd partitions + brokers
Worker FleetStateless agentsExecute actions (webhook, SQS). Pull-based for backpressureAutoscale on consumer lag
Lease StoreRedisSub-ms distributed locking at 50K ops/secRedis Cluster
Execution LedgerCassandra5.4 TB/month write-heavy historyAdd nodes
DLQKafka topicCapture jobs that exhausted retriesSame Kafka cluster
CoordinationZooKeeperLeader election with fast failover3–5 node ensemble

Recurring job loop: The scheduler pre-computes next_run_at at queue time (before execution happens). So even if the worker crashes, the next occurrence is already set. No orphaned recurring jobs.

Request Flow — Step Through
User / APIJob StoreSchedulerKafkaWorkerRedis LeaseTargetLedger
Click Next Step to walk through the request flow.
06

Deep Dive — Exactly-Once Execution

True exactly-once is impossible in a distributed system — you cannot distinguish "worker crashed after executing but before acknowledging" from "worker crashed before executing." So we build three layers of defense that make duplicates extremely rare and their consequences harmless.

Layer 1 — Distributed Lease

Redis SET NX EX — only one worker holds the lease at a time. Prevents concurrent duplicates. Workers heartbeat to extend the lease. Cost: ~1ms per operation.

Layer 2 — Fencing Tokens

Monotonically increasing counter per execution. Target rejects requests with stale tokens. Prevents sequential duplicates from GC pauses or slow workers. Cost: ~1ms.

Layer 3 — Target Idempotency

Execution carries an idempotency key. Target checks "already processed?" before doing work. Catches everything Layers 1+2 miss. Cost: 2-5ms DB check at target.

The Key Insight

Exactly-once is a property of the entire system (scheduler + target), not just the scheduler alone. The scheduler guarantees at-least-once; the target makes it effectively exactly-once.

sequenceDiagram participant S as Scheduler Leader participant DB as Job Store participant K as Kafka participant W as Worker participant R as Redis participant T as Target Service S->>DB: SELECT due jobs (next_run_at <= NOW+2min) DB-->>S: 5000 jobs S->>DB: UPDATE status=QUEUED, bump version, set next_run_at S->>K: Push jobs with version + fence_token W->>K: Pull batch (consumer group) W->>DB: Check version (stale?) DB-->>W: Version matches ✓ W->>R: SET execution:id NX EX 300 (acquire lease) R-->>W: OK (lease acquired) W->>T: POST webhook (X-Fence-Token, X-Execution-Id) T->>T: Check fence token + idempotency T-->>W: 200 OK W->>R: DEL execution:id (release lease) W->>K: Commit offset

The Slow Worker Problem (Why Leases Alone Aren't Enough)

Worker A acquires a lease, starts executing, then enters a 310-second GC pause. The lease expires at 300 seconds. Worker B acquires the lease and executes the same job. Worker A wakes up and fires the webhook again — the lease didn't help because A didn't know it had expired. Fencing tokens solve this: each lease acquisition increments a counter. The target rejects requests with stale tokens.

The Outbox Pattern — Atomic Pre-Fetch

The scheduler must mark jobs QUEUED in the DB and push to Kafka — two different systems, no shared transaction. If the scheduler crashes between them, jobs get stuck in QUEUED. The fix: write both the status update and an outbox entry in a single DB transaction. A separate outbox relay publishes to Kafka. If it crashes, it replays unpublished entries on restart.

Kafka Offset Commit — The Trickiest Duplicate Source

If a worker executes successfully but crashes before committing the Kafka offset, Kafka redelivers to another worker. The window for duplicates equals the lease timeout. We use at-least-once delivery (commit after executing) because missed jobs are worse than duplicates for a scheduler. Layers 1–3 minimize the duplicate window.

07

Key Design Decisions & Tradeoffs

Pull vs Push to Workers

✓ Chosen

Pull-Based (Kafka Consumers)

Workers consume at their own pace. Backpressure is automatic — busy workers stop pulling. Scaling = add consumers. Kafka handles partition assignment.

✗ Alternative

Push-Based (Scheduler Dispatches)

Lower latency (immediate push), but scheduler must track worker capacity and handle assignment failures. Choose for real-time systems like Uber dispatch.

DB Polling vs Pre-Fetch + Queue

✓ Chosen

Pre-Fetch into Kafka

Scheduler loads jobs due in next 2 minutes in batches. Midnight spike of 4.2M is spread over the pre-fetch window. Workers consume from a buffer.

✗ Alternative

Real-Time DB Polling

Simpler — no queue infrastructure. Works up to ~500K jobs. Breaks when midnight spike returns millions of rows in a single SELECT.

Single Leader vs Partitioned Schedulers

✓ Chosen

Single Leader (ZooKeeper)

One node pre-fetches — no coordination needed, no duplicate pre-fetch risk. Handles our 10M-job scale easily. Hot standbys for HA.

✗ Alternative

Multi-Active Partitioned

Each scheduler owns a shard. Better fault isolation. Needed at 100M+ jobs when one node can't process the pre-fetch volume.

Delivery Guarantee

✓ Chosen

At-Least-Once + Deduplication

Jobs may fire more than once in rare edge cases. Three deduplication layers minimize this. Targets must be idempotent for critical jobs.

✗ Alternative

At-Most-Once

Commit offset before executing — job may be missed if worker crashes. Unacceptable for a scheduler — missed jobs are worse than duplicates.

Cron Evaluation: Lazy vs Eager

✓ Chosen

Lazy (Next Occurrence Only)

Compute only the next next_run_at. Correct across DST transitions — timezone resolved at evaluation time, not pre-computed.

✗ Alternative

Eager (Pre-Compute 30 Days)

Store all future executions. Enables calendar UI but 100× more storage. Pre-computed UTC times break on DST boundaries.

Execution History Store

✓ Chosen

Separate Store (Cassandra)

900M records/day is write-heavy. Time-partitioned storage with cheap retention management. Different consistency requirements than job metadata.

✗ Alternative

Same PostgreSQL

Simple JOINs, one system. But 900M inserts/day + index maintenance + vacuum pressure will crush a single Postgres instance.

08

What Can Go Wrong

Scheduler Leader Dies Mid-Pre-Fetch

Leader marks 2,000 jobs as QUEUED, pushes to Kafka, then crashes before processing the remaining 3,000. The 2,000 already in Kafka execute normally. Any jobs stuck in QUEUED (marked but never pushed) are detected by the recovery sweep on the new leader — it finds QUEUED jobs older than 5 minutes with no execution record and resets them to SCHEDULED. Using the outbox pattern avoids this entirely.

Worker Crashes Mid-Execution

The webhook may or may not have been received by the target. The lease expires after the configured timeout (default 5 minutes). Kafka redelivers the unacknowledged message to another worker. If the first worker's webhook did succeed, the second execution is a duplicate — caught by fencing tokens (Layer 2) or target idempotency (Layer 3).

Thundering Herd at Midnight

4.2M jobs all due at 00:00:00. Mitigated with three layers: pre-fetch spread (start loading at 23:58, not 00:00), execution jitter (0-5s random delay for non-CRITICAL jobs), and per-target rate limiting (cap webhook rate to prevent target service collapse).

Redis Failure (Lease Store Down)

Workers can't acquire leases. We fail open with a circuit breaker — execute without the lease rather than stop all scheduling. Duplicate risk increases temporarily, but Layer 3 (target idempotency) catches it. Availability over consistency on the execution path.

Kafka Cluster Down

Scheduler can't push, workers can't pull — execution stops. The scheduler enters degraded mode: buffers to a local WAL or falls back to direct DB polling. Jobs accumulate and execute when Kafka recovers. Operations team is alerted immediately.

Poison Job (Infinite Crash Loop)

A buggy webhook causes the worker to OOM. Retry counter never increments because the worker dies before reporting. Kafka redelivers endlessly. Fix: track delivery count in Kafka headers. After N deliveries, auto-route to DLQ regardless of retry policy.

Clock Skew Between Nodes

If the scheduler's clock is wrong, jobs fire early or late. Mitigation: NTP everywhere, use database time (NOW()) instead of application time for the pre-fetch query, and monitor clock drift with alerts above 1-second threshold.

Database Failover During Pre-Fetch

Transaction rolls back — no data loss. The scheduler reconnects to the new primary and retries in the next cycle. If using async replication, brief split-brain is possible, but the job version field and lease prevent duplicate execution.

09

Interview Tips

💡
Start with the single-machine baseline.
"One Postgres table, one ticker process, inline execution." Then identify the four breakdowns at scale. Every distributed component exists to solve one of them. This makes your design feel derived, not memorized.
Never claim exactly-once.
Say: "We implement at-least-once with three layers of deduplication that make duplicates effectively impossible." Explain the theoretical limit (you can't distinguish crashed-after-executing from crashed-before). This shows you understand distributed systems fundamentals.
🎯
Use numbers surgically.
Don't front-load 5 minutes of napkin math. Instead: "The midnight spike is 4.2M jobs in one second — that's why I need pre-fetching." The number justifies the decision, not the other way around.
🔑
Draw the state machine early.
SCHEDULED → QUEUED → RUNNING → SUCCEEDED/FAILED → RETRYING → DEAD. Every failure scenario maps to a state transition. It gives the interviewer a shared vocabulary for follow-up questions.
🧠
Mention what most candidates miss.
Timezone DST handling, poison job detection, tenant fairness, the recurring job loop (who computes next_run_at and when), and secret management for webhook auth headers. These signal production experience.
🎯
Every tech choice needs a "because."
Not "I'll use Kafka" but "The midnight spike is 4.2M messages in one second — Kafka handles millions/sec and gives me partitioned consumption and durable replay."
10

Similar Problems

Distributed Task Queue

Same worker fleet + queue pattern. No time dimension — tasks pushed directly by application.

Notification System

Same thundering herd problem amplified. Fan-out from one campaign to millions of recipients.

Rate Limiter

Same time-bucketing pattern. Distributed Redis state. Clock skew sensitivity.

Workflow Orchestrator

Adds DAG dependencies on top of our scheduler. Our system could be the execution engine underneath Airflow/Temporal.

Delayed Message Queue

The inner loop of our scheduler — moving messages from "scheduled" to "ready" based on time. Simplified version of our pre-fetch.

Event-Driven Automation

Our scheduler handles the time-trigger half of Zapier/IFTTT. Event triggers require an event bus — a different subsystem.

11

Evolution

Each stage is triggered by a specific breaking point. Don't build Stage 5 on day one — let the numbers force you.

1

Single-Node Cron Poller — <100K jobs

One Postgres table, one ticker process with a thread pool. No Kafka, no Redis, no ZooKeeper. Simple and correct. Breaks when the process dies (no HA) or slow jobs block the ticker.

2

Separated Workers + Basic HA — 100K–1M jobs

Decouple scheduling from execution with a Redis list queue. Hot standby scheduler via Postgres advisory locks. Workers use SELECT FOR UPDATE SKIP LOCKED for dedup. Breaks when midnight spike query returns 500K rows in 8 seconds.

3

Pre-Fetch + Kafka + Leases — 1M–10M jobs (Target Design)

Pre-fetching replaces real-time polling. Kafka replaces Redis list (priority topics, durability). ZooKeeper replaces advisory locks. Redis leases + fencing tokens. Cassandra for execution history. The design we built in this document.

4

Partitioned Scheduler + Sharded DB — 10M–500M jobs

Shard the job store by tenant_id. Each shard gets its own scheduler leader. Redis Cluster replaces single Redis. Worker fleet autoscales on Kafka consumer lag. Blast radius is per-shard, not global.

5

Multi-Region Global Scale — 500M+ jobs

Region-local scheduling stacks. Global control plane for tenant-to-region routing. Follow-the-sun capacity sharing. Data sovereignty compliance. GPS-synced NTP for cross-region time consistency.

Next up