06
Deep Dive — The Video Encoding Pipeline
Why This Is the Hard Part
Most candidates describe the CDN and stop. But the encoding pipeline is what makes YouTube possible. Every video you watch has been automatically processed into 6+ renditions by a distributed job system. Understanding how that works — and why it's designed the way it is — separates good answers from great ones.
When a raw video arrives, three problems need solving simultaneously:
resilience (uploads fail), decoupling (encoding is slow), and speed (encoding is embarrassingly parallel). Each problem has a clean solution.
One upload → six renditions
Sequence — Upload to Video Live
Mermaid.js
sequenceDiagram
participant C as Client
participant US as Upload Service
participant S3r as S3 Raw
participant Q as Kafka Queue
participant TW as Transcoding Workers
participant S3v as S3 Renditions
participant CDN as CDN Edge
participant DB as Metadata DB
C->>US: POST /initiate → upload_id
loop Per 5MB chunk
C->>US: PUT /chunks/:n (idempotent)
end
C->>US: POST /complete
US->>S3r: Store raw file
US->>Q: Publish { video_id, raw_path }
US-->>C: 202 Accepted — processing
par Parallel transcoding (per rendition)
Q->>TW: Consume event
TW->>S3r: Fetch raw segments
TW->>TW: FFmpeg transcode (DAG jobs)
TW->>S3v: Store renditions
end
TW->>CDN: Pre-warm edge caches
TW->>DB: Mark video status = ready
Problem 1 — Resilient uploads. A 2GB file can't be a single HTTP request. Networks drop. Phones sleep. The solution is chunked uploading — split the file into 5MB pieces, upload each independently with its own retry. The server tracks received chunks in Redis. A failed chunk retries without restarting the entire upload. The upload is idempotent per chunk — safe to retry any number of times.
Problem 2 — Decoupled transcoding. Encoding takes minutes. You can't block the upload service waiting. The upload service does two things: stores the raw file in S3, then publishes an event to Kafka. That's it — it returns 202 Accepted and walks away. Transcoding workers consume the queue independently, at their own pace, on their own machines. If workers crash, the message stays in Kafka and retries. The two services are completely decoupled.
Problem 3 — Parallel DAG jobs. A naïve transcoder processes renditions sequentially. But the 1080p job has no dependency on the 4K job. And the first minute of a video has no dependency on the last. So the pipeline splits across two dimensions: one worker per rendition (fan-out by format), and the video is first split into 10-second segments (fan-out by time). With enough workers, a 2-hour video transcodes in minutes, not hours. This DAG approach is what lets YouTube make a video watchable within ~5 minutes of upload.
Netflix's edge: Netflix has weeks before a title goes live. They run a complexity analysis pass first — measuring how visually complex each scene is. A dark action sequence needs more bits. A static talking head needs fewer. They then set a per-title bitrate ladder — a custom encoding profile for each show. Same perceived quality at lower bandwidth. This is only possible because they control the intake timeline.
Client · Chunked upload→Upload Service · 202 Accepted→Kafka Queue · Event published→Segment Split · 10s chunks→Workers ×N · Parallel FFmpeg→S3 Renditions · 6 formats stored→CDN Pre-warm · Edges populated→Video Live · Status = ready
Click Next Step to walk through the request flow.