System Design — 05

YouTube / Netflix

A video streaming platform that ingests raw user-uploaded content (YouTube) or studio-licensed media (Netflix), transcodes it into adaptive renditions, and delivers billions of views globally with sub-2-second start times. The hard problems live in two separate worlds: the write path (encoding pipeline) and the read path (CDN delivery).

⚡ Core: Video Encoding Pipeline500M DAURead-heavy 200:1500h uploaded/minAdaptive BitrateCDN-first delivery
02

Requirements

Functional
  • Users can upload videos — raw file, any format, up to several GB
  • Uploaded videos are automatically transcoded into multiple resolutions
  • Users can stream videos with adaptive quality based on connection speed
  • Users can search for videos by title, description, and tags
  • Users see a personalised recommendations feed on the homepage
Non-Functional
  • 99.99% availability — video must always be watchable
  • First frame in < 2 seconds — drop-off spikes after 2s buffer
  • Upload async OK — users tolerate a few minutes for processing
  • 99.999999% durability — losing someone's video is catastrophic
  • Eventual consistency OK — new videos can take minutes to propagate globally
YouTube vs Netflix — Key Difference

YouTube's hard problem is the write side — processing 500 hours of video per minute from arbitrary sources. Netflix's hard problem is the read side — delivering studio-quality content flawlessly to 200M subscribers. Same product surface, fundamentally different architecture pressure.

YouTube
  • User-generated content — anyone uploads
  • 500 hours of video per minute
  • Reactive transcoding pipeline
  • Fixed bitrate ladder per rendition
  • Hard problem: write path scale
Netflix
  • Studio-licensed catalog — controlled intake
  • Hundreds of titles per year
  • Proactive transcoding — weeks before release
  • Per-title encoding — custom bitrate per show
  • Hard problem: read path quality & reliability
03

Scale Estimation

MetricCalculationResult
Daily Active Users YouTube benchmark 500M DAU
Video views/day ~2 views per DAU ~1B views/day
View RPS 1B ÷ 86,400 ~11,600 RPS
Upload volume Industry stat 500h video/min
Raw ingestion bandwidth 500h × 60min × ~1GB/10min ~50 GB/s
Storage per year (raw) 50 GB/s × 86,400 × 365 ~1.5 EB/year
Storage multiplier (transcoding) 6 renditions × ~0.7 avg ratio ~5× raw size
Read : Write ratio 1B views ÷ ~5M uploads ~200:1
The Number That Drives Everything

The 200:1 read-to-write ratio is the most important number. This is almost entirely a read and delivery problem once a video is processed. Every architectural decision — CDN-first, aggressive caching, pre-warming, adaptive bitrate — flows from this single fact. Writes can be slow and async. Reads must be instant.

04

API Design

POST /v1/videos/initiate Start a resumable upload session
// Request { "filename": "vacation.mov", "size_bytes": 2147483648, "content_type": "video/quicktime" } // Response 201 { "upload_id": "upl_abc123", "chunk_size_bytes": 5242880, "total_chunks": 409 }
PUT /v1/videos/:upload_id/chunks/:n Upload a single 5MB chunk
// Body: raw binary chunk data (5MB) // Response 200 — idempotent, safe to retry { "chunk": 47, "received": true, "chunks_remaining": 362 }
POST /v1/videos/:upload_id/complete Signal upload complete, begin transcoding
// Response 202 Accepted — async processing begins { "video_id": "vid_xyz789", "status": "processing", "eta_seconds": 300 }
GET /v1/videos/:video_id Fetch video metadata and manifest URL
// Response 200 { "id": "vid_xyz789", "title": "My Vacation", "duration_s": 342, "manifest_url": "https://cdn.example.com/vid_xyz789/manifest.m3u8", "thumbnail_url": "https://cdn.example.com/vid_xyz789/thumb.jpg", "status": "ready" }
GET /v1/search?q=:query&page_token=:token Two-stage retrieval + ranking
// Response 200 { "results": [ { "id": "vid_abc", "title": "...", "score": 0.92 } ], "next_page_token": "pg_2" }
GET /v1/feed?user_id=:id&page_token=:token Personalised recommendations homepage
// Response 200 { "videos": [ ... ], "next_page_token": "pg_2" }
05

High-Level Architecture

Full System — Write + Read + Search Paths SVG Diagram
WRITE PATH READ PATH CLIENT Upload Client Chunked upload HTTPS SERVICE Upload Service Validates, stores raw STORAGE S3 Raw Original file QUEUE Kafka Decouples pipeline TRANSCODING WORKERS — PARALLEL (DAG) 4K VP9/AV1 20 Mbps ~slow 1080p H.264 8 Mbps ~medium 720p H.264 4 Mbps ~fast ··· 360p 480p 144p STORAGE S3 Renditions All resolutions DATABASE Metadata DB MySQL — video ready CLIENT Viewer / Player ABR streaming COMPUTE API Server Returns manifest URL read metadata CACHE Redis Cache Hot metadata + counts CDN EDGE Open Connect / PoP 95%+ of traffic served here MANIFEST HLS / DASH Rendition index file DELIVERY Video Chunks 2–10s segments, CDN-served ABR loop pre-warm CDN SEARCH PATH CLIENT Search query SERVICE Search Service retrieve INDEX Elasticsearch rank RANKING ML Ranker Client/Service Queue/Cache Storage CDN (critical path) Async / pre-warm CDN pre-warm
Kafka Queue

The architectural pivot between upload and transcoding. The upload service publishes an event and returns 202 Accepted immediately. Transcoding workers consume independently — decoupled, scalable, resilient to worker crashes.

Transcoding Workers

A fleet of CPU-heavy machines running FFmpeg. Parallelised in two dimensions: one worker per rendition, and the video is split into 10-second segments so each segment transcodes concurrently — a DAG of jobs.

CDN Edge

The most critical component on the read path. Netflix's Open Connect Appliances sit inside ISP data centres. YouTube uses Google's private PoP network. Over 95% of traffic never reaches origin servers.

Metadata DB + Redis

MySQL stores video metadata — title, owner, manifest URL, status. Redis caches hot video metadata and approximate view counts. View counts use Redis INCR and batch-flush to DB — exact real-time counts are unnecessary.

06

Deep Dive — The Video Encoding Pipeline

Why This Is the Hard Part

Most candidates describe the CDN and stop. But the encoding pipeline is what makes YouTube possible. Every video you watch has been automatically processed into 6+ renditions by a distributed job system. Understanding how that works — and why it's designed the way it is — separates good answers from great ones.

When a raw video arrives, three problems need solving simultaneously: resilience (uploads fail), decoupling (encoding is slow), and speed (encoding is embarrassingly parallel). Each problem has a clean solution.

One upload → six renditions

4K
VP9 / AV1
~20 Mbps
1080p
H.264
~8 Mbps
720p
H.264
~4 Mbps
480p
H.264
~1.5 Mbps
360p
H.264
~0.7 Mbps
144p
H.264
~0.1 Mbps
Sequence — Upload to Video Live Mermaid.js
sequenceDiagram participant C as Client participant US as Upload Service participant S3r as S3 Raw participant Q as Kafka Queue participant TW as Transcoding Workers participant S3v as S3 Renditions participant CDN as CDN Edge participant DB as Metadata DB C->>US: POST /initiate → upload_id loop Per 5MB chunk C->>US: PUT /chunks/:n (idempotent) end C->>US: POST /complete US->>S3r: Store raw file US->>Q: Publish { video_id, raw_path } US-->>C: 202 Accepted — processing par Parallel transcoding (per rendition) Q->>TW: Consume event TW->>S3r: Fetch raw segments TW->>TW: FFmpeg transcode (DAG jobs) TW->>S3v: Store renditions end TW->>CDN: Pre-warm edge caches TW->>DB: Mark video status = ready

Problem 1 — Resilient uploads. A 2GB file can't be a single HTTP request. Networks drop. Phones sleep. The solution is chunked uploading — split the file into 5MB pieces, upload each independently with its own retry. The server tracks received chunks in Redis. A failed chunk retries without restarting the entire upload. The upload is idempotent per chunk — safe to retry any number of times.

Problem 2 — Decoupled transcoding. Encoding takes minutes. You can't block the upload service waiting. The upload service does two things: stores the raw file in S3, then publishes an event to Kafka. That's it — it returns 202 Accepted and walks away. Transcoding workers consume the queue independently, at their own pace, on their own machines. If workers crash, the message stays in Kafka and retries. The two services are completely decoupled.

Problem 3 — Parallel DAG jobs. A naïve transcoder processes renditions sequentially. But the 1080p job has no dependency on the 4K job. And the first minute of a video has no dependency on the last. So the pipeline splits across two dimensions: one worker per rendition (fan-out by format), and the video is first split into 10-second segments (fan-out by time). With enough workers, a 2-hour video transcodes in minutes, not hours. This DAG approach is what lets YouTube make a video watchable within ~5 minutes of upload.

Netflix's edge: Netflix has weeks before a title goes live. They run a complexity analysis pass first — measuring how visually complex each scene is. A dark action sequence needs more bits. A static talking head needs fewer. They then set a per-title bitrate ladder — a custom encoding profile for each show. Same perceived quality at lower bandwidth. This is only possible because they control the intake timeline.

Request Flow — Step Through
Client · Chunked uploadUpload Service · 202 AcceptedKafka Queue · Event publishedSegment Split · 10s chunksWorkers ×N · Parallel FFmpegS3 Renditions · 6 formats storedCDN Pre-warm · Edges populatedVideo Live · Status = ready
Click Next Step to walk through the request flow.
07

Key Design Decisions & Tradeoffs

Option A — Chosen
Async transcoding via queue

Upload service publishes to Kafka and returns immediately. Workers consume independently. Resilient to worker failures. Scales to any upload volume by adding workers. Enables parallel segment-level processing.

✓ Required at YouTube's scale
Option B
Synchronous transcoding

Upload service waits for encoding to finish before returning. Simple to reason about. But a single slow 4K upload blocks a thread for 10+ minutes. One traffic spike saturates the upload service. Does not scale past hundreds of uploads/hour.

✗ Fails under any real load
Option A — Chosen (Netflix)
Proactive CDN cache warming

Push popular content to edge caches before requests arrive. No cold start. First viewer gets the same fast experience as the millionth. Netflix pre-warms Open Connect Appliances during off-peak hours with predicted popular content.

✓ Eliminates cold-start latency spikes
Option B
Pull-based CDN caching

Standard CDN behaviour — cache on first request, serve subsequent requests from cache. Simple. No infrastructure overhead. But the first user in a region after a cache miss hits origin — potentially slow and expensive for a viral video.

~ Fine for small-scale deployments
Option A — Chosen
Chunked resumable upload

Split uploads into 5MB chunks. Each chunk is idempotent — safe to retry. Server tracks progress in Redis. A 2GB upload that fails at 1.8GB resumes from chunk 380, not from zero. Works on any network quality.

✓ Standard for any large file upload
Option B
Single-stream upload

One HTTP connection for the entire file. Simplest to implement. But any network interruption means restarting from zero. A user on mobile uploading a 2GB video has a near-zero chance of success on a single stream.

✗ Unusable on mobile / poor connections
Option A — Chosen (at scale)
Erasure coding for cold storage

Split file into data + parity chunks, distributed across disks. Can reconstruct from partial data. ~1.5× storage overhead vs 3× for replication. At petabyte scale, the difference between 1.5 EB and 3 EB is billions of dollars.

✓ Netflix + YouTube both use this
Option B
3× replication

Store three identical copies on separate disks/nodes. Simple, fast reads, well-understood failure model. But 3× storage cost is prohibitive at exabyte scale. Fine for small deployments or hot-tier data.

~ Start here, migrate later
08

What Can Go Wrong

🔥
Transcoding Worker Crash Mid-Job

A worker dies halfway through encoding a 4K rendition. Without idempotency, you'd have partial files in S3 and a corrupted video. The Kafka message was consumed but never acknowledged — it'll be redelivered to another worker and re-processed from scratch.

→ Fix: Idempotent jobs — each 10s segment is a separate job with unique ID. Failed segments retry independently. Kafka's at-least-once delivery ensures no message is lost.
🌊
Viral Video CDN Cache Miss Storm

A video goes viral. Millions of viewers hit play simultaneously in a region where the CDN hasn't cached the content yet. All requests fall back to origin — a thundering herd that overwhelms the origin server.

→ Fix: Pre-warm CDN edges immediately after transcoding completes. For uploads gaining rapid traction, a watch-count threshold triggers additional pre-warming. CDN request coalescing ensures only one origin fetch per cache miss, not one per user.
📴
Upload Service Crash Mid-Chunk

User is uploading chunk 47 of 200. The upload service instance dies. Without stateful tracking, the client has no idea which chunks were received and must restart the entire upload.

→ Fix: Chunk state stored in Redis (not in-process memory). Client retries against any healthy instance — the state is in Redis, not the server. Client can query which chunks are missing and resume exactly.
Metadata DB Overload on Viral Content

A video goes viral — 10 million simultaneous viewers, each triggering a metadata read. The DB gets crushed. Latency spikes. Errors cascade.

→ Fix: Redis cache in front of metadata DB for hot videos. View counts use approximate counters (Redis INCR, batch-flush every 30s) — exact real-time counts aren't needed. Hot videos get a "pinned" flag in Redis that bypasses DB entirely.
🕐
Kafka Consumer Lag

A sudden spike in uploads (live event, viral moment) floods the Kafka topic. Transcoding workers can't keep up. Queue depth grows. Videos take hours instead of minutes to become watchable.

→ Fix: Auto-scale transcoding worker fleet based on queue depth. Kafka's consumer group model means adding more workers is just deploying more instances — no coordination needed. Prioritise short videos (faster to process) over long ones during load spikes.

Anti-patterns

🚫
Stream directly from S3

Origin bandwidth bills + latency kill the playback.

✓ Better: Multi-CDN edge caching; HLS segments served from edges; origin is last resort.
🚫
Transcode on playback request

GPU time per request is absurd; viewers see spinner for minutes.

✓ Better: Pre-transcode all renditions on upload; manifest points at pre-built segments.
🚫
Stream a single 1080p file to all devices

Mobile on 3G can't handle 1080p; desktop wants 4K.

✓ Better: Adaptive bitrate (HLS/DASH); player chooses rendition per-segment based on bandwidth.
09

Interview Tips

01

Draw the write path and read path separately — from the start. This is the single most clarifying thing you can do. Say: "I'll design the upload pipeline first, then the watch path — they're completely separate systems." Candidates who conflate them end up drawing confusing spaghetti diagrams.

02

The queue between upload and transcoding is the pivotal decision. State it proactively: "We can't synchronously encode video — it takes minutes. So the upload service publishes to Kafka and returns 202 immediately. This decouples ingestion from processing." If you don't say this, the interviewer will probe until you get there. Get there first.

03

Name the CDN strategy specifically. Don't say "we'll use a CDN." Say: "Netflix uses Open Connect Appliances — physical servers inside ISPs, pre-warmed with popular content. YouTube uses Google's own PoP network. Both serve 95%+ of traffic without ever hitting origin." Specificity signals depth.

04

Know adaptive bitrate streaming by name. The player doesn't pick one quality and stick to it. It requests 2–10 second chunks independently, measures download speed between each chunk, and picks the best rendition for the next one. Know HLS and DASH by name — they're the two dominant standards.

05

Common follow-up: "How do you handle a video going viral 5 minutes after upload?" Answer: immediately after transcoding completes, push the manifest and first few chunks to CDN edges proactively — before the traffic wave hits. This is the pre-warming pattern. Without it, the first million viewers all get a slow cache-miss experience simultaneously.

06

YouTube vs Netflix is a strong signal of depth. Most candidates describe one system. Interviewers love when you distinguish them: "YouTube's hard problem is the write side — reactive transcoding at scale. Netflix's hard problem is the read side — per-title encoding and Open Connect delivery." This shows architectural thinking, not just pattern matching.

10

Similar Problems

The patterns here — async pipeline via queue, CDN-first delivery, chunked upload — appear in many other problems. Mastering this one gives you a significant head start.

11

How the Design Evolves

Phase 1 — 0 to 10K users
Synchronous encoding, single server

Upload to S3, trigger a single FFmpeg process, transcode to 2–3 renditions. No queue, no workers. Ship fast — get the product in front of users. Most startups that build video features live here forever. Synchronous is fine when uploads are rare.

Phase 2 — 10K to 1M users
Add queue + worker pool + CDN

The moment uploads start failing or taking too long, introduce SQS or Kafka and a pool of transcoding workers. Add a third-party CDN (Cloudflare, Akamai) for video delivery. Add Redis for metadata caching. Most video startups live here. The architecture that handles 1M users is usually sufficient for 10M.

Phase 3 — 1M to 100M users
Segment-level parallelism + multi-region

Introduce DAG-based transcoding — split videos into segments, transcode in parallel. Multi-region deployment. CDN cache warming based on engagement signals. Separate microservices for upload, transcoding, delivery, and search. A dedicated recommendations engine. Auto-scaling worker fleet based on queue depth.

Phase 4 — 100M+ users
Per-title encoding + embedded CDN hardware

Netflix-level: Open Connect Appliances inside ISPs. Per-title encoding with complexity analysis. Erasure coding for cold storage at exabyte scale. ML-driven predictive cache population. Custom ABR algorithms tuned per device class and network type. Most engineers never operate here — but knowing it demonstrates architectural depth in interviews.

Next up