LeetCode

Users submit code solutions in 15+ languages. The system compiles, runs against hidden test cases with strict time/memory limits, and returns a verdict — all within seconds. The hard parts: sandboxed execution that prevents user code from escaping or consuming unbounded resources, a judge queue that handles 10K concurrent contest submissions without starving practice users, and plagiarism detection across millions of historical submissions. LeetCode processes ~10M submissions/day.

⚡ Core: Sandbox + Judge Queue + Verdicts10M submissions/day15+ languagesStrict time/mem limitsLive contests

Requirements

Functional

Submit code in Python, Java, C++, Go, JS, Rust, etc.
Compile (if compiled lang) → run against hidden test cases → return verdict (Accepted, Wrong Answer, TLE, MLE, RE)
Enforce per-problem time limit (e.g., 2 sec) and memory limit (e.g., 256 MB)
Live contests with thousands of concurrent submitters; contest leaderboard ranked by solve-time + penalty
Practice mode: submit anytime, see pass/fail per test case
Run history, editorial solutions, discussion forum
Plagiarism detection across submissions for the same problem

Non-Functional

Verdict returned in < 10 seconds for practice; < 30 s for large test suites
Sandbox isolation — user code cannot read other submissions, access network, or escape container
Scale to 10K concurrent judge jobs during peak contest
Consistent verdicts — same code always produces same result (deterministic execution)
Fair during contests — all submissions judged with equal resources (no advantage from lucky node placement)
99.9% availability — downtime during a contest is catastrophic

Scale Estimation

Submissions / day

~10M

~115/sec avg; 10K+/sec burst during contest start

Avg execution time

~2 sec

compile + run all test cases; varies by language

Test cases / problem

~50–200

hidden; streamed to judge; not loaded all at once

Judge workers needed (peak)

~5K

10K submissions × 2 sec each / 4 workers/machine

Languages supported

15+

each needs its own compiler/runtime in the sandbox image

Submission storage

~10 TB / year

10M/day × ~3 KB code avg; test cases separate

API Design

POST/api/submissions

Submit code. Body: {problem_id, language, code, contest_id?}. Returns {submission_id, status: "queued"}. Client polls or subscribes via WebSocket for verdict.

GET/api/submissions/{id}

Get submission status + verdict. Returns {status, verdict, runtime_ms, memory_kb, test_cases_passed, total_test_cases, error_output?}.

POST/api/run

"Run code" (practice mode). Like submit but only runs against sample test cases (not hidden). Returns output. No leaderboard impact.

GET/api/contests/{id}/leaderboard?page=1

Contest leaderboard. Ranked by (problems_solved DESC, total_time ASC). Paginated. Updated near-real-time as verdicts land.

GET/api/problems/{id}/submissions?user_id=X

User's submission history for a problem. Returns list of {submission_id, verdict, language, runtime_ms, submitted_at}.

Architecture

Three tiers: submission API (accept + queue), judge farm (sandboxed execution), and result store (verdicts + leaderboard). The judge farm is the interesting part — each worker is a short-lived sandbox that compiles, runs, and compares output, then dies.

Judge Pipeline ArchitectureSVG

Request Flow — Step Through

User · submits code→Submission svc · persist + enqueue→Judge Queue · priority lanes→Worker (gVisor) · compile + run→Test cases · stream from S3→Verdict svc · compare output→Leaderboard · sorted set update

Click Next Step to walk through the request flow.

Deep Dive — Sandboxed Execution + Judge Queue

The fundamental tension: you're running arbitrary untrusted code on your servers. User code can fork-bomb, allocate terabytes, read /etc/passwd, open sockets, or try to escape to the host. The sandbox must make all of these impossible while still allowing normal computation.

Sandbox options (from strongest to lightest):

VM per submission (Firecracker). Full kernel isolation. ~125 ms boot. Strongest. AWS Lambda uses this. Expensive at 10K concurrent.
gVisor (Google's user-space kernel). Intercepts syscalls, reimplements them in user space. Containers run normally but kernel calls go through gVisor. Moderate overhead (~10% slower), excellent isolation. LeetCode-class services use this.
Container with seccomp + namespaces. Standard Linux containers with restricted syscall set. Lighter but shares host kernel — kernel exploits can escape. Fine for trusted code; risky for arbitrary.

Judge pipeline, step by step:

Submission lands in queue. Two priority lanes: contest (higher) and practice (lower). During a contest, practice submissions are throttled but not dropped.
Worker picks job. Spins up a gVisor sandbox with the language runtime (pre-warmed container pool for Python, Java, C++). Pulls user code from S3.
Compile (if needed). C++/Java/Rust compiled inside sandbox. Time limit on compilation (e.g., 30 s). Compile error → verdict = CE.
Run against test cases, one by one. Test cases streamed from S3 (not loaded all at once — some problems have 200+ cases, each with large input). For each: set stdin, run, capture stdout, compare against expected. On first Wrong Answer in contest mode → stop (fail-fast). In practice mode → run all for feedback.
Resource enforcement. cgroups for memory limit (kill if exceeds). Wall-clock timer for time limit (SIGKILL after TL × 1.5 for safety). No network access (sandbox drops all outbound).
Verdict emitted. Worker writes {submission_id, verdict, runtime_ms, memory_kb, test_cases_passed} to verdict service. Worker sandbox destroyed. Leaderboard updated if contest.

Judge Execution SequenceMermaid

sequenceDiagram participant U as User participant S as Submission svc participant Q as Judge Queue participant W as Worker (gVisor) participant TC as S3 (test cases) participant V as Verdict svc participant L as Leaderboard U->>S: POST /submissions {code} S->>S: persist to Postgres + S3 S->>Q: enqueue (priority=contest) S-->>U: submission_id, status=queued Q->>W: dequeue job W->>W: spin gVisor sandbox W->>W: compile (C++ → binary) loop each test case W->>TC: stream test input W->>W: run with time/mem limits W->>W: compare output Note over W: Wrong Answer → stop (contest) end W->>V: verdict {AC, 45ms, 12MB} V->>L: update leaderboard V-->>U: push verdict via WS

Deterministic execution. Same code must produce the same verdict every time. Threats: random() in user code (seed it), floating-point non-determinism (accept epsilon), time-dependent code (mock system clock). Workers run on identical hardware specs; container images are pinned versions.

Plagiarism detection. Offline batch job after each contest. Uses MOSS-style (Measure of Software Similarity) fingerprinting: normalize code (strip whitespace, rename variables), extract k-gram hashes, compare Jaccard similarity. Flag pairs above threshold for manual review. AST-based comparison catches deeper structural plagiarism than text-based.

Interview answer

"Submissions enqueue with contest > practice priority. Workers are gVisor-sandboxed containers from a pre-warmed pool. Each worker compiles (if needed), streams test cases from S3 one by one, runs with cgroup-enforced time/memory limits, and compares stdout against expected output. Fail-fast on first wrong answer in contest mode. Verdict written to result service + leaderboard updated via sorted set in Redis. Plagiarism detected offline via MOSS k-gram fingerprinting. At 10K concurrent contest submissions, autoscale the worker pool to ~5K gVisor instances."

⚠

Anti-patterns

🚫

Run user code directly on the host (no sandbox)

fork-bomb, memory exhaustion, file system read, network exfiltration — trivial for any malicious submitter.

✓ Better: gVisor or Firecracker sandbox with restricted syscalls, no network, cgroup mem/cpu limits.

🚫

Load all 200 test cases into memory before running

Large test suites (e.g., graph problems with 10M-node inputs) exhaust judge memory. Startup latency grows.

✓ Better: Stream test cases from S3 one at a time. Each case: read input → run → compare → discard → next.

🚫

Single queue for contest + practice (no priority)

During a 10K-person contest, practice submissions block contest verdicts. Contest participants see 5-minute wait times.

✓ Better: Priority queue: contest submissions drain first; practice throttled (not dropped) during contest peaks.

Tradeoffs & Design Choices

VM (Firecracker) vs gVisor vs bare container. Firecracker: strongest isolation, ~125 ms boot, expensive at scale. gVisor: strong (user-space kernel), ~30 ms overhead, good balance. Bare container: lightest, shared kernel risk. For a judge: gVisor is the sweet spot.
Fail-fast vs run-all. Contest: stop on first wrong answer (saves compute, faster verdicts). Practice: run all test cases (better feedback — "you passed 47/50"). Configurable per submission type.
Pre-warmed pool vs cold-start. gVisor containers take ~1–2 s to cold-start (pull image, init runtime). Pre-warmed pool: keep 500+ idle containers ready. Costs memory but gives sub-second start. During contest: burst beyond pool → cold-start for overflow.
Test case storage. S3 for the source of truth; local SSD cache on judge nodes for hot problems (top-100 problems account for 80% of submissions). Cache hit → no S3 round-trip.
Leaderboard freshness. Redis sorted set updated on each verdict → real-time. For 10K-person contest: sorted set with score = (problems_solved × 10000) - total_time_seconds. O(log N) per update.

Failure Modes

🧨

Sandbox escape — user code accesses host

Zero-day in gVisor or kernel namespace allows user code to read other submissions or host data.

→ Mitigation: defense in depth — gVisor + seccomp + read-only rootfs + no network + ephemeral containers (destroyed after each run). Audit + upgrade gVisor regularly.

🐢

Judge queue backs up during mega-contest

50K submissions in 5 minutes; worker pool can't keep up. Verdict latency climbs to 5+ minutes.

→ Mitigation: autoscale workers on queue depth. Pre-warm extra capacity 10 min before contest starts. Admission-rate-limit submissions per user (max 3/min).

🎲

Non-deterministic verdict

User code uses random(), uninitialized memory, or timing-dependent logic. Same code → different verdicts on retry.

→ Mitigation: seed random deterministically per test case. Pin container CPU affinity. Flag non-deterministic outcomes for re-judge. Accept small floating-point epsilon.

💀

Fork bomb / resource exhaustion inside sandbox

User code calls fork() in a loop; 100K processes inside the container.

→ Mitigation: cgroup PID limit (e.g., max 32 processes). Syscall filter blocks fork() entirely for single-threaded problems.

📋

Plagiarism during live contest

Participants share solutions in real time via Discord / Telegram. Leaderboard polluted.

→ Mitigation: post-contest MOSS run flags identical submissions. Participants with Jaccard > 0.85 reviewed + penalized. Randomized test-case ordering makes screenshot-sharing less useful.

🔥

Test case data corruption

Wrong expected output for a test case → correct solutions get "Wrong Answer." Infuriating for users.

→ Mitigation: test cases version-controlled + validated by running reference solution before publish. Any test case change triggers re-judge of recent submissions.

Interview Tips

Lead with the sandbox. "We're running arbitrary untrusted code — isolation is the #1 concern." Then name gVisor or Firecracker.
Priority queue for contest vs practice. A single queue is wrong. Name the priority lanes explicitly.
Stream test cases, don't batch-load. Large inputs (10M-node graph) can't be pre-loaded into memory. Stream from S3 per case.
Autoscale workers, not the API. The API is lightweight; the workers are the bottleneck. Scale workers on queue depth.
Plagiarism is offline, not real-time. Too expensive to run during contest. Batch after. MOSS / AST fingerprint.

Evolution

MVP — single server, Docker, sequential

One machine. Docker container per submission. Sequential execution. Works for ~100 submissions/day. UVa Online Judge era.

Queue + worker pool + priority lanes

SQS/Redis queue. Worker fleet autoscales. Contest priority lane. Handles ~10K/day.

gVisor sandbox + pre-warmed container pool

Stronger isolation. Sub-second container start. Handles concurrent contests with 10K+ participants.

Streaming test cases + fail-fast

Test cases streamed from S3; not batch-loaded. Fail-fast on wrong answer in contest. Reduces avg judge time 40%.

AI-powered hints + plagiarism + editorial generation

LLM generates hints on wrong answer. MOSS + AST for plagiarism. Auto-generate editorial from reference solution. Modern LeetCode (2024+).

📺

References & Videos

Design an Online Judge System

Arpit Bhayani · 22 min

Design Online Judge Like LeetCode

GeeksforGeeks

gVisor — Container Sandbox Runtime

gVisor (Google)

Next up

PROBLEM

Distributed Job Scheduler

Queue + worker pool pattern; different payload

Read →

PROBLEM

Distributed Queue

Priority queue with consumer groups

Read →

LeetCode

Requirements

Scale Estimation

API Design

Architecture

Deep Dive — Sandboxed Execution + Judge Queue

Anti-patterns

Tradeoffs & Design Choices

Failure Modes

Interview Tips

Similar Problems

Distributed Job Scheduler

Distributed Queue

Leaderboard

Flash Sale

Video Conferencing

Evolution

MVP — single server, Docker, sequential

Queue + worker pool + priority lanes

gVisor sandbox + pre-warmed container pool

Streaming test cases + fail-fast

AI-powered hints + plagiarism + editorial generation

References & Videos

Distributed Job Scheduler

Distributed Queue