Exercise · Infrastructure

Distributed Job Scheduler

Whiteboard exercise. Try the problem cold, then reveal the rubric to self-score.

Out of 10 points45 min whiteboardReference solution →

Prompt

How do you reliably execute millions of scheduled tasks at precisely the right time across a fleet of unreliable machines, ensuring no job is missed and no job runs twice?

Time budget: 45 min whiteboard. Draw architecture, estimate numbers, discuss tradeoffs.

Hints (progressive — click to reveal)

Hint 1

Start with requirements: functional vs non-functional. Clarify the scale (users, QPS, storage).

Hint 2

Think about the data model first. What entities exist? What are the access patterns?

Hint 3

Identify the hardest sub-problem and deep-dive into it. Show you can go beyond boxes and arrows.

Rubric — 10 points

+2 Back-of-envelope estimation with concrete numbers
+2 Clear API design with key endpoints
+2 Sensible data model and storage choices
+2 Addresses scalability (sharding, caching, CDN)
+2 Discusses failure modes and mitigations

Self-score: tally the points you would have mentioned unprompted. 7+ is interview-ready on this problem.

Red flags (things that tank the interview)

No back-of-envelope estimation — jumps straight into components without quantifying scale for Distributed Job Scheduler
Single point of failure — no replication, failover, or redundancy discussed
Ignores data model and storage choices — hand-waves the database layer