Users submit code solutions in 15+ languages. The system compiles, runs against hidden test cases with strict time/memory limits, and returns a verdict — all within seconds. The hard parts:
sandboxed execution that prevents user code from escaping or consuming unbounded resources,
a judge queue that handles 10K concurrent contest submissions without starving practice users,
and plagiarism detection across millions of historical submissions. LeetCode processes ~10M submissions/day.
Submit code in Python, Java, C++, Go, JS, Rust, etc.
Compile (if compiled lang) → run against hidden test cases → return verdict (Accepted, Wrong Answer, TLE, MLE, RE)
Enforce per-problem time limit (e.g., 2 sec) and memory limit (e.g., 256 MB)
Live contests with thousands of concurrent submitters; contest leaderboard ranked by solve-time + penalty
Practice mode: submit anytime, see pass/fail per test case
Run history, editorial solutions, discussion forum
Plagiarism detection across submissions for the same problem
Non-Functional
Verdict returned in < 10 seconds for practice; < 30 s for large test suites
Sandbox isolation — user code cannot read other submissions, access network, or escape container
Scale to 10K concurrent judge jobs during peak contest
Consistent verdicts — same code always produces same result (deterministic execution)
Fair during contests — all submissions judged with equal resources (no advantage from lucky node placement)
99.9% availability — downtime during a contest is catastrophic
03
Scale Estimation
Submissions / day
~10M
~115/sec avg; 10K+/sec burst during contest start
Avg execution time
~2 sec
compile + run all test cases; varies by language
Test cases / problem
~50–200
hidden; streamed to judge; not loaded all at once
Judge workers needed (peak)
~5K
10K submissions × 2 sec each / 4 workers/machine
Languages supported
15+
each needs its own compiler/runtime in the sandbox image
Submission storage
~10 TB / year
10M/day × ~3 KB code avg; test cases separate
04
API Design
POST/api/submissions
Submit code. Body: {problem_id, language, code, contest_id?}. Returns {submission_id, status: "queued"}. Client polls or subscribes via WebSocket for verdict.
GET/api/submissions/{id}
Get submission status + verdict. Returns {status, verdict, runtime_ms, memory_kb, test_cases_passed, total_test_cases, error_output?}.
POST/api/run
"Run code" (practice mode). Like submit but only runs against sample test cases (not hidden). Returns output. No leaderboard impact.
GET/api/contests/{id}/leaderboard?page=1
Contest leaderboard. Ranked by (problems_solved DESC, total_time ASC). Paginated. Updated near-real-time as verdicts land.
GET/api/problems/{id}/submissions?user_id=X
User's submission history for a problem. Returns list of {submission_id, verdict, language, runtime_ms, submitted_at}.
05
Architecture
Three tiers: submission API (accept + queue), judge farm (sandboxed execution), and result store (verdicts + leaderboard). The judge farm is the interesting part — each worker is a short-lived sandbox that compiles, runs, and compares output, then dies.
The fundamental tension: you're running arbitrary untrusted code on your servers. User code can fork-bomb, allocate terabytes, read /etc/passwd, open sockets, or try to escape to the host. The sandbox must make all of these impossible while still allowing normal computation.
Sandbox options (from strongest to lightest):
VM per submission (Firecracker). Full kernel isolation. ~125 ms boot. Strongest. AWS Lambda uses this. Expensive at 10K concurrent.
gVisor (Google's user-space kernel). Intercepts syscalls, reimplements them in user space. Containers run normally but kernel calls go through gVisor. Moderate overhead (~10% slower), excellent isolation. LeetCode-class services use this.
Container with seccomp + namespaces. Standard Linux containers with restricted syscall set. Lighter but shares host kernel — kernel exploits can escape. Fine for trusted code; risky for arbitrary.
Judge pipeline, step by step:
Submission lands in queue. Two priority lanes: contest (higher) and practice (lower). During a contest, practice submissions are throttled but not dropped.
Worker picks job. Spins up a gVisor sandbox with the language runtime (pre-warmed container pool for Python, Java, C++). Pulls user code from S3.
Compile (if needed). C++/Java/Rust compiled inside sandbox. Time limit on compilation (e.g., 30 s). Compile error → verdict = CE.
Run against test cases, one by one. Test cases streamed from S3 (not loaded all at once — some problems have 200+ cases, each with large input). For each: set stdin, run, capture stdout, compare against expected. On first Wrong Answer in contest mode → stop (fail-fast). In practice mode → run all for feedback.
Resource enforcement. cgroups for memory limit (kill if exceeds). Wall-clock timer for time limit (SIGKILL after TL × 1.5 for safety). No network access (sandbox drops all outbound).
Verdict emitted. Worker writes {submission_id, verdict, runtime_ms, memory_kb, test_cases_passed} to verdict service. Worker sandbox destroyed. Leaderboard updated if contest.
Judge Execution SequenceMermaid
sequenceDiagram
participant U as User
participant S as Submission svc
participant Q as Judge Queue
participant W as Worker (gVisor)
participant TC as S3 (test cases)
participant V as Verdict svc
participant L as Leaderboard
U->>S: POST /submissions {code}
S->>S: persist to Postgres + S3
S->>Q: enqueue (priority=contest)
S-->>U: submission_id, status=queued
Q->>W: dequeue job
W->>W: spin gVisor sandbox
W->>W: compile (C++ → binary)
loop each test case
W->>TC: stream test input
W->>W: run with time/mem limits
W->>W: compare output
Note over W: Wrong Answer → stop (contest)
end
W->>V: verdict {AC, 45ms, 12MB}
V->>L: update leaderboard
V-->>U: push verdict via WS
Deterministic execution. Same code must produce the same verdict every time. Threats: random() in user code (seed it), floating-point non-determinism (accept epsilon), time-dependent code (mock system clock). Workers run on identical hardware specs; container images are pinned versions.
Plagiarism detection. Offline batch job after each contest. Uses MOSS-style (Measure of Software Similarity) fingerprinting: normalize code (strip whitespace, rename variables), extract k-gram hashes, compare Jaccard similarity. Flag pairs above threshold for manual review. AST-based comparison catches deeper structural plagiarism than text-based.
Interview answer
"Submissions enqueue with contest > practice priority. Workers are gVisor-sandboxed containers from a pre-warmed pool. Each worker compiles (if needed), streams test cases from S3 one by one, runs with cgroup-enforced time/memory limits, and compares stdout against expected output. Fail-fast on first wrong answer in contest mode. Verdict written to result service + leaderboard updated via sorted set in Redis. Plagiarism detected offline via MOSS k-gram fingerprinting. At 10K concurrent contest submissions, autoscale the worker pool to ~5K gVisor instances."
⚠
Anti-patterns
🚫
Run user code directly on the host (no sandbox)
fork-bomb, memory exhaustion, file system read, network exfiltration — trivial for any malicious submitter.
✓ Better: gVisor or Firecracker sandbox with restricted syscalls, no network, cgroup mem/cpu limits.
🚫
Load all 200 test cases into memory before running
Large test suites (e.g., graph problems with 10M-node inputs) exhaust judge memory. Startup latency grows.
✓ Better: Stream test cases from S3 one at a time. Each case: read input → run → compare → discard → next.
🚫
Single queue for contest + practice (no priority)
During a 10K-person contest, practice submissions block contest verdicts. Contest participants see 5-minute wait times.
✓ Better: Priority queue: contest submissions drain first; practice throttled (not dropped) during contest peaks.
07
Tradeoffs & Design Choices
VM (Firecracker) vs gVisor vs bare container. Firecracker: strongest isolation, ~125 ms boot, expensive at scale. gVisor: strong (user-space kernel), ~30 ms overhead, good balance. Bare container: lightest, shared kernel risk. For a judge: gVisor is the sweet spot.
Fail-fast vs run-all. Contest: stop on first wrong answer (saves compute, faster verdicts). Practice: run all test cases (better feedback — "you passed 47/50"). Configurable per submission type.
Pre-warmed pool vs cold-start. gVisor containers take ~1–2 s to cold-start (pull image, init runtime). Pre-warmed pool: keep 500+ idle containers ready. Costs memory but gives sub-second start. During contest: burst beyond pool → cold-start for overflow.
Test case storage. S3 for the source of truth; local SSD cache on judge nodes for hot problems (top-100 problems account for 80% of submissions). Cache hit → no S3 round-trip.
Leaderboard freshness. Redis sorted set updated on each verdict → real-time. For 10K-person contest: sorted set with score = (problems_solved × 10000) - total_time_seconds. O(log N) per update.
08
Failure Modes
🧨
Sandbox escape — user code accesses host
Zero-day in gVisor or kernel namespace allows user code to read other submissions or host data.
→ Mitigation: defense in depth — gVisor + seccomp + read-only rootfs + no network + ephemeral containers (destroyed after each run). Audit + upgrade gVisor regularly.
🐢
Judge queue backs up during mega-contest
50K submissions in 5 minutes; worker pool can't keep up. Verdict latency climbs to 5+ minutes.
→ Mitigation: autoscale workers on queue depth. Pre-warm extra capacity 10 min before contest starts. Admission-rate-limit submissions per user (max 3/min).
🎲
Non-deterministic verdict
User code uses random(), uninitialized memory, or timing-dependent logic. Same code → different verdicts on retry.
→ Mitigation: seed random deterministically per test case. Pin container CPU affinity. Flag non-deterministic outcomes for re-judge. Accept small floating-point epsilon.
💀
Fork bomb / resource exhaustion inside sandbox
User code calls fork() in a loop; 100K processes inside the container.
→ Mitigation: cgroup PID limit (e.g., max 32 processes). Syscall filter blocks fork() entirely for single-threaded problems.
📋
Plagiarism during live contest
Participants share solutions in real time via Discord / Telegram. Leaderboard polluted.
→ Mitigation: post-contest MOSS run flags identical submissions. Participants with Jaccard > 0.85 reviewed + penalized. Randomized test-case ordering makes screenshot-sharing less useful.
🔥
Test case data corruption
Wrong expected output for a test case → correct solutions get "Wrong Answer." Infuriating for users.
→ Mitigation: test cases version-controlled + validated by running reference solution before publish. Any test case change triggers re-judge of recent submissions.
09
Interview Tips
Lead with the sandbox. "We're running arbitrary untrusted code — isolation is the #1 concern." Then name gVisor or Firecracker.
Priority queue for contest vs practice. A single queue is wrong. Name the priority lanes explicitly.
Stream test cases, don't batch-load. Large inputs (10M-node graph) can't be pre-loaded into memory. Stream from S3 per case.
Autoscale workers, not the API. The API is lightweight; the workers are the bottleneck. Scale workers on queue depth.
Plagiarism is offline, not real-time. Too expensive to run during contest. Batch after. MOSS / AST fingerprint.