Video Conferencing

01

Problem Statement

Design a real-time video conferencing system that supports one-on-one calls, group meetings of up to 49 active video participants, and large webinars with up to 1,000 audio participants. The system must deliver sub-300ms mouth-to-ear latency, gracefully adapt to varying network conditions, and scale to 300M+ daily participants globally.

Unlike typical request-response web systems, video conferencing is a continuous bidirectional real-time media pipeline. You're not serving web pages — you're routing live audio/video frames every 20ms. The constraints are physics-level: speed of light, codec latency, and jitter buffers.

Core question: How do you deliver real-time audio/video to hundreds of participants across the globe with sub-300ms latency, while gracefully adapting to wildly varying network conditions?

02

Requirements

Functional Requirements

1:1 and group video calls — up to 49 active video, 1,000 audio participants
Screen sharing — additional media stream from any participant
Meeting management — create, schedule, join via link, waiting rooms, host controls (mute, kick, admit)
In-meeting chat — text alongside the call
Recording — server-side recording with post-meeting compositing
Reactions & hand-raise — lightweight signaling alongside media

Non-Functional Requirements

Ultra-low latency — end-to-end <300ms mouth-to-ear (200ms ideal); beyond 400ms conversation breaks
High availability — 99.99% uptime; conferencing outages are immediately visible
Adaptive quality — graceful degradation on poor networks rather than dropping the call
Global scale — 300M+ daily participants across every continent
Security — transport encryption default, E2EE optional; meeting access controls

03

Scale Estimation

Grounded in Zoom-scale numbers. 300M daily participants, average 40-minute meetings with 6 participants → ~50M meetings/day.

300M

Daily Participants

~50M

Peak Concurrent

~1,700/s

Peak Meetings Started

~80 Pbps

Peak Bandwidth Ingest

~500 TB/day

New Recordings

~5M/s

Signaling Events

What the Numbers Tell Us

The bandwidth numbers are the entire story. You cannot route 80+ Pbps through any data center. This forces a massively distributed edge architecture — media servers in every major metro, streams never touching a central location unless cascading is needed. The system is closer to a CDN than a traditional web app.

04

API Design

Two distinct API surfaces: a REST API for meeting lifecycle and a WebSocket signaling protocol for in-call control.

A) Meeting Lifecycle — REST

Create Meeting

POST /api/v1/meetings
Authorization: Bearer <token>

{
  "title": "Sprint Planning",
  "type": "scheduled",
  "start_time": "2026-04-08T10:00:00Z",
  "duration_minutes": 60,
  "settings": {
    "max_participants": 100,
    "waiting_room": true,
    "mute_on_entry": true,
    "allow_recording": true,
    "e2ee_enabled": false
  }
}

→ 201: { "meeting_id": "m_8f3k29x", "join_url": "https://meet.example.com/j/8f3k29x" }

Join Meeting — Get Connection Details

POST /api/v1/meetings/{meeting_id}/join
→ 200: {
  "participant_id": "p_29fk3m",
  "session_token": "st_...",
  "media_server": {
    "url": "wss://edge-dubai-01.media.example.com",
    "region": "me-south-1",
    "ice_servers": [
      { "urls": "stun:stun.example.com:3478" },
      { "urls": "turn:turn-dubai.example.com:443", "credential": "..." }
    ]
  }
}

Key: The join response returns the nearest media server and ICE/TURN credentials. The client doesn't pick a server — the backend does geo-routing.

B) Real-Time Signaling — WebSocket

Once joined, all in-meeting communication flows over a persistent WebSocket — bidirectional JSON messages for publishing tracks (with simulcast layers), subscribing to other participants, muting, reactions, chat, SDP exchange, and ICE candidate trickle.

C) SDP Exchange

The WebRTC handshake uses SDP offer/answer for codec negotiation, encryption fingerprint exchange, and simulcast layer declaration. ICE candidates are trickled asynchronously as they're discovered.

05

High-Level Architecture

Video conferencing has three fundamentally different planes that scale independently — the key insight that separates this from typical web architecture.

Control Plane

REST APIs, meeting CRUD, auth, scheduling. Standard web backend — stateless services, PostgreSQL, Redis. The boring (but necessary) stuff.

Signaling Plane

Real-time meeting state over WebSocket. Who's in the call, mute states, SDP negotiation, chat. Stateful, low-bandwidth, needs reliability and ordering.

Media Plane

Actual audio/video packets over UDP/RTP. This is a real-time CDN — 99% of infrastructure cost and complexity. SFU servers at 80+ edge PoPs globally.

Capacity at Peak

Component	Scale
Media Edge Servers (SFU)	50,000–100,000 across 80+ PoPs
Signaling Servers	2,000–5,000
TURN Relays	5,000–10,000
Recording Workers (GPU)	2,000–5,000 burst
Control Plane	500–1,000 (standard web tier)

Request Flow — Step Through

Client→GeoDNS→Control Plane→Signaling→ICE/STUN→SFU→Media Flows

Click Next Step to walk through the request flow.

06

Deep Dives — 10 Rabbit Holes

Video conferencing is uniquely rich in deep technical domains. Each deep dive covers the single most interesting aspect of its subsystem.

#1 SFU vs MCU vs Mesh — The architectural backbone #2 WebRTC Signaling — ICE, SDP, NAT traversal #3 Adaptive Bitrate — GCC, TWCC, congestion control #4 Audio Pipeline — AEC, mixing, jitter buffers #5 E2E Encryption — MLS, SFrame, trust models #6 Global Cascading — PoP placement, backbone #7 Recording Pipeline — Capture, composite, GPU #8 Scalable Signaling — Meeting state at 300M #9 Codec Selection — VP9, AV1, H.264 tradeoffs #10 Packet Loss Recovery — FEC, NACK, concealment

06.1

SFU vs MCU vs Mesh

The single most important architectural decision in video conferencing. Every other choice cascades from it.

Factor	Mesh	MCU	SFU
Server CPU	None	Extreme	Minimal
Client upload	N-1 streams	1 stream	1 stream (3 layers)
Added latency	~0ms	100-200ms	1-5ms
Per-receiver quality	No	No	Yes (simulcast)
E2EE possible	Yes	No	Yes
Max video participants	3-4	50-100	49 video / 1000 audio
Cost at scale	$0	$$$$$$	$$

SFU never decodes or encodes video — it forwards encrypted RTP packets at the packet level. Combined with simulcast (sender encodes 3 quality layers: 720p/360p/180p), each receiver gets quality-optimized per their bandwidth. The SFU makes per-subscriber, per-stream forwarding decisions — a metadata operation, not a transcoding operation.

Production hybrid: 1:1 calls → Mesh. Small groups → Single SFU. 10-49 → SFU with aggressive simulcast. 50-1000 → SFU + server audio mixing. 1000+ → SFU for panelists + CDN/HLS for audience.

06.2

WebRTC Signaling & Connection Establishment

Before a single frame of video flows, there's an elaborate dance: SDP offer/answer for codec negotiation, ICE for NAT traversal (gathering host/srflx/relay candidates, connectivity checks), and DTLS for encryption key exchange.

~85% of users connect via STUN (direct UDP), ~15% need TURN relay (restrictive NATs/firewalls). Total time from click to first media: ~700ms, optimized to ~500ms with pre-gathering, STUN caching, and DTLS session resumption.

Chain of trust: Signaling TLS → SDP fingerprint → DTLS → SRTP keys. The signaling server relays negotiation but never sees media.

06.3

Adaptive Bitrate & Congestion Control

Google Congestion Control (GCC) uses two parallel estimators: a delay-based controller (Kalman filter on inter-arrival time gradients) and a loss-based controller. The system is fast to downgrade (15% reduction) and slow to upgrade (8% probe) — protecting real-time experience over maximizing quality.

TWCC (Transport-Wide Congestion Control) provides per-receiver bandwidth estimates at the SFU, enabling independent simulcast layer selection per subscriber. Alice on fiber gets 720p while Bob on mobile gets 360p — from the same sender.

Degradation Ladder

>2.5 Mbps: 720p30 + 360p30 + 180p30 → 1.5-2.5: 720p15 + 360p30 → 0.8-1.5: 360p30 + 180p15 → 0.3-0.8: 360p15 → <0.3: Audio only → <80 Kbps: Opus narrowband. Audio ALWAYS wins.

06.4

Audio Pipeline — AEC, Mixing, Jitter Buffers

Users tolerate terrible video but abandon calls within 10 seconds of bad audio. The pipeline: Capture → Noise Suppression (neural network, RNNoise-style) → AEC (adaptive NLMS filter, 100-300ms taps) → AGC → VAD → Opus encode (20ms frames, in-band FEC) → Network → Jitter Buffer (adaptive, 30-60ms) → Decode → Mix → Playback.

AEC is the hardest DSP problem — modeling room acoustics in real-time, handling double-talk detection, non-linear speaker distortion, and variable system latency (Android: 50-200ms, highly variable). Server-side audio mixing for large meetings selects top-3 loudest speakers with hysteresis, creating personalized N-speaker mixes excluding each participant's own audio.

06.5

End-to-End Encryption

Double encryption: SFrame encrypts the payload with a meeting key (SFU can't decrypt), SRTP encrypts the transport. Key exchange via MLS protocol (RFC 9420) using a ratchet tree — O(log N) rekeying on participant join/leave vs O(N²) for sender keys.

E2EE disables: server-side recording, live transcription, server audio mixing, PSTN dial-in, and compliance monitoring. This is why it's opt-in, not default — the industry consensus across Zoom, Teams, and Meet.

06.6

Global Media Server Placement & Cascading

80-150+ PoPs globally in three tiers: 15 mega-PoPs (cascade hubs), 40 regional, 50+ micro-PoPs at ISP peering points. Each unique stream crosses any inter-PoP link exactly once, regardless of subscriber count on each side.

Cascade topology is dynamic and per-meeting: direct link for 2 PoPs, star for 3-5, minimum latency spanning tree for 6+. Dedicated backbone between Tier 1 PoPs delivers consistent 70-85ms RTT with <0.1% loss, vs public internet's variable 80-140ms with 0.5-3% loss.

06.7

Recording & Compositing Pipeline

Separate capture from compositing. Recording agent writes raw encoded tracks to S3 during the meeting (just file I/O, ~0.1 CPU cores per meeting). Post-meeting, GPU workers composite into gallery/speaker view MP4 — decode all tracks, synchronize via RTP timestamps, layout computation, composite, re-encode with H.264+AAC.

GPU compositing: ~2 minutes per hour of meeting. CPU-only: ~30 minutes. GPU is non-negotiable at scale. Lazy generation: only gallery view by default, speaker view and individual tracks on demand — reduces GPU usage by ~60%.

06.8

Scalable Signaling — Meeting State at 300M Users

Meeting state lives in Redis (hot) + PostgreSQL (cold). Mutations use optimistic concurrency with epoch numbers. Cross-server broadcast via Redis Pub/Sub (same region) + NATS JetStream (cross-region). Meeting-affine routing places all small-meeting participants on the same signaling server, eliminating pub/sub overhead.

Meeting state is single-leader in the host's region — all mutations route there. Non-home participants accept ~80-120ms extra latency on state changes (acceptable for the 500ms signaling budget). Event sourcing provides full audit trail, state reconstruction, analytics, and compliance.

06.9

Codec Selection — VP8/VP9/AV1/H.264

VP9 is the current default (30% better compression than H.264, native SVC). H.264 is mandatory fallback (Safari/iOS only supports H.264 for WebRTC — 25% of users). AV1 is the future (60-80% better than H.264) but hardware encode isn't universal yet.

Mixed-codec meetings handled by dual-publish (VP9 sender also sends H.264) or codec unification. Audio is settled: Opus won — royalty-free, better than every competitor at every bitrate, seamless speech/music switching, built-in FEC.

06.10

Last-Mile Quality — FEC, NACK, Packet Loss Recovery

WiFi is the primary villain (contention, interference, bufferbloat). Three recovery strategies in priority cascade: FEC (FlexFEC, Reed-Solomon codes, adaptive rate, 2D interleaving for burst loss) → NACK retransmission (RTX stream, only when RTT < jitter buffer depth × 0.6) → Concealment (frame freeze, motion compensation, Opus PLC).

Keyframes get 2-3x more FEC redundancy. Audio has triple protection: Opus in-band FEC + RFC 2198 redundancy + FlexFEC. Audio survives up to 20-30% packet loss with barely perceptible degradation. Always reduce video bitrate BEFORE adding FEC to avoid the FEC death spiral.

07

Universal hardware support but worst compression. At 50M concurrent participants, 30% worse compression = petabits of wasted bandwidth = hundreds of millions in annual cost difference.

08

What Can Go Wrong

SFU Server Crash Mid-Meeting

1,200 participants lose media. Clients detect WebSocket drop (3-5s), reconnect to signaling, get assigned new SFU, renegotiate SDP+ICE. 3-8 second interruption. Meeting state preserved in signaling layer. Mitigated by health checks, graceful draining, and stateless SFU design.

Entire PoP Outage

All participants routing through that PoP lose connectivity. Failover to next-nearest PoP (35-80ms RTT increase). 10-30 second outage. Mitigated by multi-path connectivity, BGP anycast, client-side PoP failover lists, and capacity headroom in adjacent PoPs.

Client Network Degradation (WiFi Death Spiral)

Roommate starts Netflix → available bandwidth drops from 10 to 2 Mbps. GCC detects within 500ms → drop simulcast layers → enable FEC → audio-only if needed. Audio always gets priority. Full adaptation takes 1-8 seconds. System suggests network switch if consistently poor.

Thundering Herd — Mass Meeting Start

50K employees join all-hands at 9:00 AM. Mitigated by: join rate limiting with queue, participant list pagination, batched state broadcasts, meeting mode escalation (auto-transition to webinar), and SFU pre-provisioning for scheduled large meetings.

Recording Pipeline Failure

Recording agent crash: last 5-10s lost (multipart uploads saved previous chunks to S3). Compositor failure: job returns to queue, retried automatically — raw tracks retained 7 days. Redundant recording agents for compliance-critical meetings.

Cascading Retry Storm

100K clients reconnect simultaneously after signaling server recovery → crash again → loop. Mitigated by exponential backoff with jitter (spreads reconnections over 12s), server-side connection rate limiting (1000/s), circuit breaker, and load shedding (prioritize ongoing meetings over new joins).

Overarching principle: Media continues flowing even when everything else breaks. The SFU forwards packets based on local state — it doesn't query Redis or depend on signaling. Meeting metadata and mute buttons can break temporarily, and the conversation continues.

09

Interview Tips

💡

Lead with the Three Planes
Immediately establish that this system has three fundamentally different planes (control, signaling, media) that scale independently. Most candidates draw one monolithic backend — separating them signals senior-level thinking.

⚡

Start with Physics, Not Components
Don't jump to "I'll use Kafka and Redis." Start with: speed of light = 200K km/s, mouth-to-ear budget = 200ms, Dubai→London = 80ms RTT. This FORCES edge deployment — let the numbers drive the architecture.

🎯

Nail the SFU Explanation
"SFU forwards encrypted packets without decoding — a smart packet router. Combined with simulcast (3 quality layers), each receiver gets bandwidth-adapted quality. 50x cheaper than MCU, adds 1-5ms latency. The SFU mesh is a real-time CDN."

🔑

The Degradation Ladder Is Your Secret Weapon
Most candidates describe the happy path. Walk through: resolution → frame rate → video off → audio quality → narrowband audio. "Audio always wins" shows you understand real-time system behavior under stress.

🗣️

Use Domain Vocabulary
Say "SFU" not "video server." Say "simulcast layers" not "different quality videos." Say "mouth-to-ear latency" not "delay." Say "cascading" not "server forwarding." Say "TWCC" not "congestion detection." Using terms naturally signals domain experience.

⏱️

Time Budget: 35 Minutes
0-3 min: clarify scope. 3-8: requirements + scale. 8-15: high-level architecture (three planes). 15-20: API design. 20-30: deep dive (SFU internals or congestion control). 30-35: tradeoffs + failures. Don't spend 15 minutes on the control plane — it's a standard web app.

🚫

Things to AVOID
Don't say "We'll just use WebRTC" (it's a browser API, not architecture). Don't use Kafka for video (latency!). Don't suggest TCP for media (head-of-line blocking). Don't suggest CDN caching (nothing to cache). Don't discuss database schemas (nobody cares for this problem).

10

Evolution

How this design grows from a weekend prototype to a planet-scale conferencing platform.

1

MVP — "It Works on My Laptop" (0-1K users)

Single server running SFU + signaling + API + PostgreSQL. Mesh for 1:1, single SFU for groups up to 6. H.264 only, no simulcast, no recording. Open-source mediasoup/Janus. One cloud VM at $50/month.

2

Multi-Server — "Paying Customers" (1K-50K users)

Separate concerns: API, signaling, SFU, TURN as independent services. Deploy in 2-3 cloud regions. Add simulcast (3 layers), VP9, Redis for state, basic recording with CPU-based FFmpeg compositing. Screen sharing, waiting rooms, host controls. $2K-10K/month.

3

Production-Grade — "Enterprise Calling" (50K-5M users)

Expand to 15-20 PoPs with SFU cascading. VP9 SVC, adaptive FEC (FlexFEC), TWCC bandwidth estimation, server-side audio mixing. GPU recording, E2EE (MLS), NATS JetStream, event sourcing. SSO/SAML, compliance recording, live transcription. $50K-200K/month.

4

Scale — "Competing with Zoom" (5M-100M users)

80-100+ PoPs in 3 tiers. Dedicated backbone between Tier 1 hubs. Dynamic per-meeting cascade topology. Meeting mode escalation (interactive → webinar → broadcast). AV1 adoption, content-aware encoding, multi-path transport. AI features: noise suppression, smart chapters, summaries. $1-5M/month.

5

Planet-Scale — "We ARE Zoom" (300M+ users)

150+ PoPs + sovereign deployments (China, EU, Russia). Own backbone (submarine cable partnerships). Custom silicon (FPGA SFU, video encoding ASICs, SmartNICs). AI-native: real-time translation, generative codec research (face landmarks at 5 Kbps), meeting Q&A. Platform SDK ecosystem. $30M+/month.

📺

References & Videos

System Design: Zoom / Google Meet

Gaurav Sen · 25 min

WebRTC Architecture Deep Dive

Hussein Nasser · 30 min

Design Zoom

AlgoMaster

WebRTC — Real-time Communication for the Web

WebRTC.org

Problem Statement

Requirements

Functional Requirements

Non-Functional Requirements

Scale Estimation

What the Numbers Tell Us

API Design

A) Meeting Lifecycle — REST

B) Real-Time Signaling — WebSocket

C) SDP Exchange

High-Level Architecture

Control Plane

Signaling Plane

Media Plane

Capacity at Peak

Deep Dives — 10 Rabbit Holes

Degradation Ladder

Key Design Decisions & Tradeoffs

Media Routing

SFU (Selective Forwarding)

MCU (Multipoint Control Unit)

Transport Protocol

UDP with App-Level Reliability

TCP (reliable transport)

Encryption Model

Transport Encryption Default, E2EE Opt-in

E2EE by Default

Deployment Model

80+ Edge PoPs Globally

5-10 Cloud Regions

Video Codec

VP9 + H.264 Fallback

H.264 Only

What Can Go Wrong

SFU Server Crash Mid-Meeting

Entire PoP Outage

Client Network Degradation (WiFi Death Spiral)

Thundering Herd — Mass Meeting Start

Recording Pipeline Failure

Cascading Retry Storm

Interview Tips

Similar Problems

Live Streaming (Twitch)

WhatsApp / Chat System

Multiplayer Gaming

CDN Design

Notification System

Uber / Location System

Evolution

MVP — "It Works on My Laptop" (0-1K users)

Multi-Server — "Paying Customers" (1K-50K users)

Production-Grade — "Enterprise Calling" (50K-5M users)

Scale — "Competing with Zoom" (5M-100M users)

Planet-Scale — "We ARE Zoom" (300M+ users)

References & Videos

Live Streaming (Twitch)

WhatsApp / Chat System

TCP vs UDP