Problem Statement
Think of cron on Linux — you write a crontab entry, the daemon wakes up every minute, checks if anything is due, and runs it. Works perfectly on one machine. Now imagine you need to send 50M promotional notifications at exactly 9:00 AM, expire 200K flash-deal prices at noon, and generate settlement reports for 100K sellers every night. One cron box can't handle this. If it dies, everything stops. If you add a second box, the same job runs twice.
That's the problem: cron, but distributed, reliable, and scalable. We're building the engine behind systems like Google Cloud Scheduler, AWS EventBridge Scheduler, or internal job scheduling infrastructure at companies like Uber and Amazon.
Core question: How do you guarantee every scheduled job fires exactly once at the right time, across machines that can crash at any moment?
The Core Tensions
Precision vs Scale
You want jobs at exactly T+0. But 4.2M jobs at midnight can't all fire at T+0. How much jitter is acceptable?
Reliability vs Simplicity
Multiple scheduler nodes means coordination to avoid duplicate execution. Coordination means complexity and new failure modes.
Exactly-Once vs Performance
Distributed locking on every job pickup is expensive. At-least-once with idempotent jobs pushes complexity to job authors.
Flexibility vs Predictability
Supporting priorities, retries, cron expressions, one-time and recurring jobs means a complex state machine with more edge cases.
Single-Machine Baseline
The simplest version: a jobs table in PostgreSQL, a ticker process that runs every second querying WHERE next_run_at <= NOW(), and inline execution. This breaks in four ways at scale: ticker dies (no jobs run), slow jobs block fast ones (backlog), millions due at midnight (query chokes), and two tickers (duplicate execution). Every distributed component exists to solve one of these four breakdowns.