Distributed Job Scheduler

01

Problem Statement

Think of cron on Linux — you write a crontab entry, the daemon wakes up every minute, checks if anything is due, and runs it. Works perfectly on one machine. Now imagine you need to send 50M promotional notifications at exactly 9:00 AM, expire 200K flash-deal prices at noon, and generate settlement reports for 100K sellers every night. One cron box can't handle this. If it dies, everything stops. If you add a second box, the same job runs twice.

That's the problem: cron, but distributed, reliable, and scalable. We're building the engine behind systems like Google Cloud Scheduler, AWS EventBridge Scheduler, or internal job scheduling infrastructure at companies like Uber and Amazon.

Core question: How do you guarantee every scheduled job fires exactly once at the right time, across machines that can crash at any moment?

The Core Tensions

Precision vs Scale

You want jobs at exactly T+0. But 4.2M jobs at midnight can't all fire at T+0. How much jitter is acceptable?

Reliability vs Simplicity

Multiple scheduler nodes means coordination to avoid duplicate execution. Coordination means complexity and new failure modes.

Exactly-Once vs Performance

Distributed locking on every job pickup is expensive. At-least-once with idempotent jobs pushes complexity to job authors.

Flexibility vs Predictability

Supporting priorities, retries, cron expressions, one-time and recurring jobs means a complex state machine with more edge cases.

Single-Machine Baseline

The simplest version: a jobs table in PostgreSQL, a ticker process that runs every second querying WHERE next_run_at <= NOW(), and inline execution. This breaks in four ways at scale: ticker dies (no jobs run), slow jobs block fast ones (backlog), millions due at midnight (query chokes), and two tickers (duplicate execution). Every distributed component exists to solve one of these four breakdowns.

02

Requirements

Functional Requirements

Create jobs — one-time ("run at time T") and recurring ("cron expression + timezone"). The scheduler is a trigger mechanism, not a compute platform — it fires webhooks or queue messages, not arbitrary code.
CRUD + Pause/Resume — update schedule, change priority, pause recurring jobs (resume from "now", not catchup), soft-delete with audit trail.
Execution history — last N runs per job with status, duration, error, retry count. Essential for debugging and trust.
Retry on failure — configurable max retries, backoff strategy (fixed, exponential, exponential with jitter). System detects worker crashes via lease expiry.
Priority support — CRITICAL > HIGH > MEDIUM > LOW. Payment reconciliation must not wait behind analytics reports.
Completion notifications — webhook callback on success, failure, or DLQ entry.

Non-Functional Requirements

Effectively exactly-once execution — at-least-once triggering + three layers of deduplication (lease, fencing, idempotency). The rabbit hole for this problem.
Low trigger jitter < 5 seconds — a job scheduled for 09:00:00 should fire by 09:00:05. This constrains polling frequency and pushes toward pre-fetching.
High availability — survives single node crash, DB failover, network partition. Missed jobs during outage execute on recovery (configurable catchup window).
Scale to 10M+ registered jobs, 50K triggers/sec at peak — rules out single-node designs. Job store, tick mechanism, and worker fleet all scale horizontally.
Durability — job definitions and execution history survive crashes. In-memory-only schedulers are unacceptable.
Multi-tenancy and fairness — tenant A's 5M midnight jobs must not starve tenant B's 10 critical jobs.

The impossible triangle: Exactly-once + High availability + Low jitter. Every design decision is picking a side of this triangle. We prioritize availability (never miss) and low jitter, then layer defenses for exactly-once.

Explicit Non-Goals

Not a workflow/DAG engine — no job dependencies (Job B after Job A). That's Airflow/Temporal territory.
Not a compute platform — we trigger jobs, we don't run arbitrary code. Workers are external.
Not sub-second precision — second-level granularity is our floor.

03

Scale Estimation

We're designing for a platform-scale scheduler — 10M registered jobs as the primary target, with a path to 100M.

~50K/s

Peak triggers per second

4.2M

Midnight spike (one second)

21 GB

Job metadata (fits single DB)

5.4 TB

Execution history per month

1.5M

In-flight jobs at peak

100 MB/s

Dispatch bandwidth

Derivation

10M jobs: 70% recurring (mix of 1-min to weekly), 30% one-time. Recurring triggers sum to ~900M/day ≈ 10K/sec average. Peak-to-average ratio of 5× gives 50K triggers/sec at peak. The thundering herd at midnight: 30% hourly + 30% daily jobs all due at 00:00:00 = 4.2M jobs in one second. This single number forces pre-fetching — you cannot query 4.2M rows from Postgres in real-time. Job metadata at 2KB each = 21 GB (fits Postgres). Execution history at 200 bytes × 900M/day × 30 days = 5.4 TB/month — needs a separate store.

Key insight: Storage is not the bottleneck — the query pattern is. "Give me all jobs where next_run_at <= NOW()" hits the same index every second, and at top-of-hour the result set explodes. This pushes us from "poll the DB" to "pre-load into a queue."

04

API Design

Two entities: Job (the definition — long-lived) and Execution (a single run — short-lived). Recurring jobs stay SCHEDULED and spawn execution instances. All write operations require an Idempotency-Key header.

Job Lifecycle State Machine

        SCHEDULED → QUEUED → RUNNING → SUCCEEDED
                        ↓
                     FAILED → RETRYING → QUEUED
                        ↓
                  DEAD (max retries exhausted)

PAUSED  (out of scheduling loop, resume → SCHEDULED)
CANCELLED (terminal, no further executions)
      

Create a Recurring Job

        POST /api/v1/jobs
Headers: Idempotency-Key: "client-uuid-123"

{
  "name": "nightly-seller-report",
  "type": "RECURRING",
  "schedule": {
    "cron": "0 2 * * *",
    "timezone": "Asia/Kolkata",
    "starts_at": "2026-04-07T00:00:00Z"
  },
  "action": {
    "type": "WEBHOOK",
    "url": "https://seller-service.internal/reports/generate",
    "method": "POST",
    "headers": { "Authorization": "Bearer {{secret_ref}}" },
    "body": { "report_type": "settlement" }
  },
  "retry_policy": {
    "max_retries": 3,
    "backoff": "EXPONENTIAL_WITH_JITTER",
    "initial_delay_seconds": 30
  },
  "priority": "HIGH",
  "tenant_id": "seller-platform"
}

→ 201 Created
{ "job_id": "job_8xk2m9f3", "status": "SCHEDULED",
  "next_run_at": "2026-04-08T02:00:00+05:30" }
      

Timezone Handling

Store cron + timezone, compute next_run_at in UTC at evaluation time — not at creation. Pre-computing UTC times for future dates gets DST transitions wrong.

Version-Based Invalidation

Every job has a version field, bumped on update. Queued executions carry the version at queue time. Workers check: if stale, discard.

Worker Internal API — Pull-Based

        POST /internal/v1/executions/claim
{ "worker_id": "worker-us-east-42", "capacity": 10 }

POST /internal/v1/executions/heartbeat
{ "worker_id": "worker-us-east-42",
  "execution_ids": ["exec_1", "exec_2"] }

POST /internal/v1/executions/{id}/complete
{ "status": "SUCCEEDED", "duration_ms": 34200 }
      

05

High-Level Architecture

Every component exists because a specific number or failure mode forced it. The single-machine baseline breaks in four ways — each component solves one.

Component	Tech	Why	Scaling
API Service	Stateless app	User-facing CRUD, decoupled from scheduling path	Horizontal, behind LB
Job Store	PostgreSQL	21 GB fits one DB, strong consistency for state transitions	Primary-replica
Scheduler	Leader-elected	Pre-fetches due jobs every 30s, prevents duplicate pre-fetch	Single leader + standbys
Dispatch Queue	Kafka	50K/sec, durable, partitioned by priority + tenant	Add partitions + brokers
Worker Fleet	Stateless agents	Execute actions (webhook, SQS). Pull-based for backpressure	Autoscale on consumer lag
Lease Store	Redis	Sub-ms distributed locking at 50K ops/sec	Redis Cluster
Execution Ledger	Cassandra	5.4 TB/month write-heavy history	Add nodes
DLQ	Kafka topic	Capture jobs that exhausted retries	Same Kafka cluster
Coordination	ZooKeeper	Leader election with fast failover	3–5 node ensemble

Recurring job loop: The scheduler pre-computes next_run_at at queue time (before execution happens). So even if the worker crashes, the next occurrence is already set. No orphaned recurring jobs.

Request Flow — Step Through

User / API→Job Store→Scheduler→Kafka→Worker→Redis Lease→Target→Ledger

Click Next Step to walk through the request flow.

06

Deep Dive — Exactly-Once Execution

True exactly-once is impossible in a distributed system — you cannot distinguish "worker crashed after executing but before acknowledging" from "worker crashed before executing." So we build three layers of defense that make duplicates extremely rare and their consequences harmless.

Layer 1 — Distributed Lease

Redis SET NX EX — only one worker holds the lease at a time. Prevents concurrent duplicates. Workers heartbeat to extend the lease. Cost: ~1ms per operation.

Layer 2 — Fencing Tokens

Monotonically increasing counter per execution. Target rejects requests with stale tokens. Prevents sequential duplicates from GC pauses or slow workers. Cost: ~1ms.

Layer 3 — Target Idempotency

Execution carries an idempotency key. Target checks "already processed?" before doing work. Catches everything Layers 1+2 miss. Cost: 2-5ms DB check at target.

The Key Insight

Exactly-once is a property of the entire system (scheduler + target), not just the scheduler alone. The scheduler guarantees at-least-once; the target makes it effectively exactly-once.

sequenceDiagram participant S as Scheduler Leader participant DB as Job Store participant K as Kafka participant W as Worker participant R as Redis participant T as Target Service S->>DB: SELECT due jobs (next_run_at <= NOW+2min) DB-->>S: 5000 jobs S->>DB: UPDATE status=QUEUED, bump version, set next_run_at S->>K: Push jobs with version + fence_token W->>K: Pull batch (consumer group) W->>DB: Check version (stale?) DB-->>W: Version matches ✓ W->>R: SET execution:id NX EX 300 (acquire lease) R-->>W: OK (lease acquired) W->>T: POST webhook (X-Fence-Token, X-Execution-Id) T->>T: Check fence token + idempotency T-->>W: 200 OK W->>R: DEL execution:id (release lease) W->>K: Commit offset

The Slow Worker Problem (Why Leases Alone Aren't Enough)

Worker A acquires a lease, starts executing, then enters a 310-second GC pause. The lease expires at 300 seconds. Worker B acquires the lease and executes the same job. Worker A wakes up and fires the webhook again — the lease didn't help because A didn't know it had expired. Fencing tokens solve this: each lease acquisition increments a counter. The target rejects requests with stale tokens.

The Outbox Pattern — Atomic Pre-Fetch

The scheduler must mark jobs QUEUED in the DB and push to Kafka — two different systems, no shared transaction. If the scheduler crashes between them, jobs get stuck in QUEUED. The fix: write both the status update and an outbox entry in a single DB transaction. A separate outbox relay publishes to Kafka. If it crashes, it replays unpublished entries on restart.

Kafka Offset Commit — The Trickiest Duplicate Source

If a worker executes successfully but crashes before committing the Kafka offset, Kafka redelivers to another worker. The window for duplicates equals the lease timeout. We use at-least-once delivery (commit after executing) because missed jobs are worse than duplicates for a scheduler. Layers 1–3 minimize the duplicate window.

07

Simple JOINs, one system. But 900M inserts/day + index maintenance + vacuum pressure will crush a single Postgres instance.

08

What Can Go Wrong

Scheduler Leader Dies Mid-Pre-Fetch

Leader marks 2,000 jobs as QUEUED, pushes to Kafka, then crashes before processing the remaining 3,000. The 2,000 already in Kafka execute normally. Any jobs stuck in QUEUED (marked but never pushed) are detected by the recovery sweep on the new leader — it finds QUEUED jobs older than 5 minutes with no execution record and resets them to SCHEDULED. Using the outbox pattern avoids this entirely.

Worker Crashes Mid-Execution

The webhook may or may not have been received by the target. The lease expires after the configured timeout (default 5 minutes). Kafka redelivers the unacknowledged message to another worker. If the first worker's webhook did succeed, the second execution is a duplicate — caught by fencing tokens (Layer 2) or target idempotency (Layer 3).

Thundering Herd at Midnight

4.2M jobs all due at 00:00:00. Mitigated with three layers: pre-fetch spread (start loading at 23:58, not 00:00), execution jitter (0-5s random delay for non-CRITICAL jobs), and per-target rate limiting (cap webhook rate to prevent target service collapse).

Redis Failure (Lease Store Down)

Workers can't acquire leases. We fail open with a circuit breaker — execute without the lease rather than stop all scheduling. Duplicate risk increases temporarily, but Layer 3 (target idempotency) catches it. Availability over consistency on the execution path.

Kafka Cluster Down

Scheduler can't push, workers can't pull — execution stops. The scheduler enters degraded mode: buffers to a local WAL or falls back to direct DB polling. Jobs accumulate and execute when Kafka recovers. Operations team is alerted immediately.

Poison Job (Infinite Crash Loop)

A buggy webhook causes the worker to OOM. Retry counter never increments because the worker dies before reporting. Kafka redelivers endlessly. Fix: track delivery count in Kafka headers. After N deliveries, auto-route to DLQ regardless of retry policy.

Clock Skew Between Nodes

If the scheduler's clock is wrong, jobs fire early or late. Mitigation: NTP everywhere, use database time (NOW()) instead of application time for the pre-fetch query, and monitor clock drift with alerts above 1-second threshold.

Database Failover During Pre-Fetch

Transaction rolls back — no data loss. The scheduler reconnects to the new primary and retries in the next cycle. If using async replication, brief split-brain is possible, but the job version field and lease prevent duplicate execution.

09

Interview Tips

💡

Start with the single-machine baseline.
"One Postgres table, one ticker process, inline execution." Then identify the four breakdowns at scale. Every distributed component exists to solve one of them. This makes your design feel derived, not memorized.

⚡

Never claim exactly-once.
Say: "We implement at-least-once with three layers of deduplication that make duplicates effectively impossible." Explain the theoretical limit (you can't distinguish crashed-after-executing from crashed-before). This shows you understand distributed systems fundamentals.

🎯

Use numbers surgically.
Don't front-load 5 minutes of napkin math. Instead: "The midnight spike is 4.2M jobs in one second — that's why I need pre-fetching." The number justifies the decision, not the other way around.

🔑

Draw the state machine early.
SCHEDULED → QUEUED → RUNNING → SUCCEEDED/FAILED → RETRYING → DEAD. Every failure scenario maps to a state transition. It gives the interviewer a shared vocabulary for follow-up questions.

🧠

Mention what most candidates miss.
Timezone DST handling, poison job detection, tenant fairness, the recurring job loop (who computes next_run_at and when), and secret management for webhook auth headers. These signal production experience.

🎯

Every tech choice needs a "because."
Not "I'll use Kafka" but "The midnight spike is 4.2M messages in one second — Kafka handles millions/sec and gives me partitioned consumption and durable replay."

10

Evolution

Each stage is triggered by a specific breaking point. Don't build Stage 5 on day one — let the numbers force you.

1

Single-Node Cron Poller — <100K jobs

One Postgres table, one ticker process with a thread pool. No Kafka, no Redis, no ZooKeeper. Simple and correct. Breaks when the process dies (no HA) or slow jobs block the ticker.

2

Separated Workers + Basic HA — 100K–1M jobs

Decouple scheduling from execution with a Redis list queue. Hot standby scheduler via Postgres advisory locks. Workers use SELECT FOR UPDATE SKIP LOCKED for dedup. Breaks when midnight spike query returns 500K rows in 8 seconds.

3

Pre-Fetch + Kafka + Leases — 1M–10M jobs (Target Design)

Pre-fetching replaces real-time polling. Kafka replaces Redis list (priority topics, durability). ZooKeeper replaces advisory locks. Redis leases + fencing tokens. Cassandra for execution history. The design we built in this document.

4

Partitioned Scheduler + Sharded DB — 10M–500M jobs

Shard the job store by tenant_id. Each shard gets its own scheduler leader. Redis Cluster replaces single Redis. Worker fleet autoscales on Kafka consumer lag. Blast radius is per-shard, not global.

5

Multi-Region Global Scale — 500M+ jobs

Region-local scheduling stacks. Global control plane for tenant-to-region routing. Follow-the-sun capacity sharing. Data sovereignty compliance. GPS-synced NTP for cross-region time consistency.

📺

References & Videos

Distributed Job Scheduler

Arpit Bhayani · 25 min

Task Scheduling at Scale

ByteByteGo · 10 min

Design a Task Scheduler

AlgoMaster

Async Task Scheduling at Dropbox

Dropbox Tech

Problem Statement

The Core Tensions

Precision vs Scale

Reliability vs Simplicity

Exactly-Once vs Performance

Flexibility vs Predictability

Single-Machine Baseline

Requirements

Functional Requirements

Non-Functional Requirements

Explicit Non-Goals

Scale Estimation

Derivation

API Design

Job Lifecycle State Machine

Timezone Handling

Version-Based Invalidation

High-Level Architecture

Deep Dive — Exactly-Once Execution

Layer 1 — Distributed Lease

Layer 2 — Fencing Tokens

Layer 3 — Target Idempotency

The Key Insight

The Slow Worker Problem (Why Leases Alone Aren't Enough)

The Outbox Pattern — Atomic Pre-Fetch

Kafka Offset Commit — The Trickiest Duplicate Source

Key Design Decisions & Tradeoffs

Pull vs Push to Workers

Pull-Based (Kafka Consumers)

Push-Based (Scheduler Dispatches)

DB Polling vs Pre-Fetch + Queue

Pre-Fetch into Kafka

Real-Time DB Polling

Single Leader vs Partitioned Schedulers

Single Leader (ZooKeeper)

Multi-Active Partitioned

Delivery Guarantee

At-Least-Once + Deduplication

At-Most-Once

Cron Evaluation: Lazy vs Eager

Lazy (Next Occurrence Only)

Eager (Pre-Compute 30 Days)

Execution History Store

Separate Store (Cassandra)

Same PostgreSQL

What Can Go Wrong

Scheduler Leader Dies Mid-Pre-Fetch

Worker Crashes Mid-Execution

Thundering Herd at Midnight

Redis Failure (Lease Store Down)

Kafka Cluster Down

Poison Job (Infinite Crash Loop)

Clock Skew Between Nodes

Database Failover During Pre-Fetch

Interview Tips

Similar Problems

Distributed Task Queue

Notification System

Rate Limiter

Workflow Orchestrator

Delayed Message Queue

Event-Driven Automation

Evolution

Single-Node Cron Poller — <100K jobs

Separated Workers + Basic HA — 100K–1M jobs

Pre-Fetch + Kafka + Leases — 1M–10M jobs (Target Design)

Partitioned Scheduler + Sharded DB — 10M–500M jobs

Multi-Region Global Scale — 500M+ jobs

References & Videos

Distributed Task Queue

Notification System

Message Queue vs Pub/Sub