System Design — 16

Video Conferencing

Design a real-time video conferencing platform like Zoom or Microsoft Teams — delivering sub-300ms audio/video to millions of concurrent participants across the globe while adapting to wildly varying network conditions.

Real-Time MediaWebRTCSFU ArchitectureEdge ComputingUDP / RTPAdaptive Bitrate
01

Problem Statement

Design a real-time video conferencing system that supports one-on-one calls, group meetings of up to 49 active video participants, and large webinars with up to 1,000 audio participants. The system must deliver sub-300ms mouth-to-ear latency, gracefully adapt to varying network conditions, and scale to 300M+ daily participants globally.

Unlike typical request-response web systems, video conferencing is a continuous bidirectional real-time media pipeline. You're not serving web pages — you're routing live audio/video frames every 20ms. The constraints are physics-level: speed of light, codec latency, and jitter buffers.

Core question: How do you deliver real-time audio/video to hundreds of participants across the globe with sub-300ms latency, while gracefully adapting to wildly varying network conditions?

02

Requirements

Functional Requirements

  • 1:1 and group video calls — up to 49 active video, 1,000 audio participants
  • Screen sharing — additional media stream from any participant
  • Meeting management — create, schedule, join via link, waiting rooms, host controls (mute, kick, admit)
  • In-meeting chat — text alongside the call
  • Recording — server-side recording with post-meeting compositing
  • Reactions & hand-raise — lightweight signaling alongside media

Non-Functional Requirements

  • Ultra-low latency — end-to-end <300ms mouth-to-ear (200ms ideal); beyond 400ms conversation breaks
  • High availability99.99% uptime; conferencing outages are immediately visible
  • Adaptive quality — graceful degradation on poor networks rather than dropping the call
  • Global scale300M+ daily participants across every continent
  • Security — transport encryption default, E2EE optional; meeting access controls
03

Scale Estimation

Grounded in Zoom-scale numbers. 300M daily participants, average 40-minute meetings with 6 participants → ~50M meetings/day.

300M
Daily Participants
~50M
Peak Concurrent
~1,700/s
Peak Meetings Started
~80 Pbps
Peak Bandwidth Ingest
~500 TB/day
New Recordings
~5M/s
Signaling Events

What the Numbers Tell Us

The bandwidth numbers are the entire story. You cannot route 80+ Pbps through any data center. This forces a massively distributed edge architecture — media servers in every major metro, streams never touching a central location unless cascading is needed. The system is closer to a CDN than a traditional web app.

04

API Design

Two distinct API surfaces: a REST API for meeting lifecycle and a WebSocket signaling protocol for in-call control.

A) Meeting Lifecycle — REST

Create Meeting
POST /api/v1/meetings
Authorization: Bearer <token>

{
  "title": "Sprint Planning",
  "type": "scheduled",
  "start_time": "2026-04-08T10:00:00Z",
  "duration_minutes": 60,
  "settings": {
    "max_participants": 100,
    "waiting_room": true,
    "mute_on_entry": true,
    "allow_recording": true,
    "e2ee_enabled": false
  }
}

→ 201: { "meeting_id": "m_8f3k29x", "join_url": "https://meet.example.com/j/8f3k29x" }
Join Meeting — Get Connection Details
POST /api/v1/meetings/{meeting_id}/join
→ 200: {
  "participant_id": "p_29fk3m",
  "session_token": "st_...",
  "media_server": {
    "url": "wss://edge-dubai-01.media.example.com",
    "region": "me-south-1",
    "ice_servers": [
      { "urls": "stun:stun.example.com:3478" },
      { "urls": "turn:turn-dubai.example.com:443", "credential": "..." }
    ]
  }
}

Key: The join response returns the nearest media server and ICE/TURN credentials. The client doesn't pick a server — the backend does geo-routing.

B) Real-Time Signaling — WebSocket

Once joined, all in-meeting communication flows over a persistent WebSocket — bidirectional JSON messages for publishing tracks (with simulcast layers), subscribing to other participants, muting, reactions, chat, SDP exchange, and ICE candidate trickle.

C) SDP Exchange

The WebRTC handshake uses SDP offer/answer for codec negotiation, encryption fingerprint exchange, and simulcast layer declaration. ICE candidates are trickled asynchronously as they're discovered.

05

High-Level Architecture

Video conferencing has three fundamentally different planes that scale independently — the key insight that separates this from typical web architecture.

Control Plane

REST APIs, meeting CRUD, auth, scheduling. Standard web backend — stateless services, PostgreSQL, Redis. The boring (but necessary) stuff.

Signaling Plane

Real-time meeting state over WebSocket. Who's in the call, mute states, SDP negotiation, chat. Stateful, low-bandwidth, needs reliability and ordering.

Media Plane

Actual audio/video packets over UDP/RTP. This is a real-time CDN — 99% of infrastructure cost and complexity. SFU servers at 80+ edge PoPs globally.

Client A Dubai Client B London GeoDNS Routing Control Plane REST API SFU — Dubai Edge PoP SFU — London Edge PoP Signaling WebSocket + NATS Recording GPU Workers Storage PG + Redis + S3 Cascade UDP/RTP (media) UDP/RTP (media)

Capacity at Peak

ComponentScale
Media Edge Servers (SFU)50,000–100,000 across 80+ PoPs
Signaling Servers2,000–5,000
TURN Relays5,000–10,000
Recording Workers (GPU)2,000–5,000 burst
Control Plane500–1,000 (standard web tier)
Request Flow — Step Through
ClientGeoDNSControl PlaneSignalingICE/STUNSFUMedia Flows
Click Next Step to walk through the request flow.
06

Deep Dives — 10 Rabbit Holes

Video conferencing is uniquely rich in deep technical domains. Each deep dive covers the single most interesting aspect of its subsystem.

06.1
SFU vs MCU vs Mesh

The single most important architectural decision in video conferencing. Every other choice cascades from it.

FactorMeshMCUSFU
Server CPUNoneExtremeMinimal
Client uploadN-1 streams1 stream1 stream (3 layers)
Added latency~0ms100-200ms1-5ms
Per-receiver qualityNoNoYes (simulcast)
E2EE possibleYesNoYes
Max video participants3-450-10049 video / 1000 audio
Cost at scale$0$$$$$$$$

SFU never decodes or encodes video — it forwards encrypted RTP packets at the packet level. Combined with simulcast (sender encodes 3 quality layers: 720p/360p/180p), each receiver gets quality-optimized per their bandwidth. The SFU makes per-subscriber, per-stream forwarding decisions — a metadata operation, not a transcoding operation.

Production hybrid: 1:1 calls → Mesh. Small groups → Single SFU. 10-49 → SFU with aggressive simulcast. 50-1000 → SFU + server audio mixing. 1000+ → SFU for panelists + CDN/HLS for audience.

06.2
WebRTC Signaling & Connection Establishment

Before a single frame of video flows, there's an elaborate dance: SDP offer/answer for codec negotiation, ICE for NAT traversal (gathering host/srflx/relay candidates, connectivity checks), and DTLS for encryption key exchange.

~85% of users connect via STUN (direct UDP), ~15% need TURN relay (restrictive NATs/firewalls). Total time from click to first media: ~700ms, optimized to ~500ms with pre-gathering, STUN caching, and DTLS session resumption.

Chain of trust: Signaling TLS → SDP fingerprint → DTLS → SRTP keys. The signaling server relays negotiation but never sees media.

06.3
Adaptive Bitrate & Congestion Control

Google Congestion Control (GCC) uses two parallel estimators: a delay-based controller (Kalman filter on inter-arrival time gradients) and a loss-based controller. The system is fast to downgrade (15% reduction) and slow to upgrade (8% probe) — protecting real-time experience over maximizing quality.

TWCC (Transport-Wide Congestion Control) provides per-receiver bandwidth estimates at the SFU, enabling independent simulcast layer selection per subscriber. Alice on fiber gets 720p while Bob on mobile gets 360p — from the same sender.

Degradation Ladder

>2.5 Mbps: 720p30 + 360p30 + 180p30 → 1.5-2.5: 720p15 + 360p30 → 0.8-1.5: 360p30 + 180p15 → 0.3-0.8: 360p15 → <0.3: Audio only → <80 Kbps: Opus narrowband. Audio ALWAYS wins.

06.4
Audio Pipeline — AEC, Mixing, Jitter Buffers

Users tolerate terrible video but abandon calls within 10 seconds of bad audio. The pipeline: Capture → Noise Suppression (neural network, RNNoise-style) → AEC (adaptive NLMS filter, 100-300ms taps) → AGC → VAD → Opus encode (20ms frames, in-band FEC) → Network → Jitter Buffer (adaptive, 30-60ms) → Decode → Mix → Playback.

AEC is the hardest DSP problem — modeling room acoustics in real-time, handling double-talk detection, non-linear speaker distortion, and variable system latency (Android: 50-200ms, highly variable). Server-side audio mixing for large meetings selects top-3 loudest speakers with hysteresis, creating personalized N-speaker mixes excluding each participant's own audio.

06.5
End-to-End Encryption

Double encryption: SFrame encrypts the payload with a meeting key (SFU can't decrypt), SRTP encrypts the transport. Key exchange via MLS protocol (RFC 9420) using a ratchet tree — O(log N) rekeying on participant join/leave vs O(N²) for sender keys.

E2EE disables: server-side recording, live transcription, server audio mixing, PSTN dial-in, and compliance monitoring. This is why it's opt-in, not default — the industry consensus across Zoom, Teams, and Meet.

06.6
Global Media Server Placement & Cascading

80-150+ PoPs globally in three tiers: 15 mega-PoPs (cascade hubs), 40 regional, 50+ micro-PoPs at ISP peering points. Each unique stream crosses any inter-PoP link exactly once, regardless of subscriber count on each side.

Cascade topology is dynamic and per-meeting: direct link for 2 PoPs, star for 3-5, minimum latency spanning tree for 6+. Dedicated backbone between Tier 1 PoPs delivers consistent 70-85ms RTT with <0.1% loss, vs public internet's variable 80-140ms with 0.5-3% loss.

06.7
Recording & Compositing Pipeline

Separate capture from compositing. Recording agent writes raw encoded tracks to S3 during the meeting (just file I/O, ~0.1 CPU cores per meeting). Post-meeting, GPU workers composite into gallery/speaker view MP4 — decode all tracks, synchronize via RTP timestamps, layout computation, composite, re-encode with H.264+AAC.

GPU compositing: ~2 minutes per hour of meeting. CPU-only: ~30 minutes. GPU is non-negotiable at scale. Lazy generation: only gallery view by default, speaker view and individual tracks on demand — reduces GPU usage by ~60%.

06.8
Scalable Signaling — Meeting State at 300M Users

Meeting state lives in Redis (hot) + PostgreSQL (cold). Mutations use optimistic concurrency with epoch numbers. Cross-server broadcast via Redis Pub/Sub (same region) + NATS JetStream (cross-region). Meeting-affine routing places all small-meeting participants on the same signaling server, eliminating pub/sub overhead.

Meeting state is single-leader in the host's region — all mutations route there. Non-home participants accept ~80-120ms extra latency on state changes (acceptable for the 500ms signaling budget). Event sourcing provides full audit trail, state reconstruction, analytics, and compliance.

06.9
Codec Selection — VP8/VP9/AV1/H.264

VP9 is the current default (30% better compression than H.264, native SVC). H.264 is mandatory fallback (Safari/iOS only supports H.264 for WebRTC — 25% of users). AV1 is the future (60-80% better than H.264) but hardware encode isn't universal yet.

Mixed-codec meetings handled by dual-publish (VP9 sender also sends H.264) or codec unification. Audio is settled: Opus won — royalty-free, better than every competitor at every bitrate, seamless speech/music switching, built-in FEC.

06.10
Last-Mile Quality — FEC, NACK, Packet Loss Recovery

WiFi is the primary villain (contention, interference, bufferbloat). Three recovery strategies in priority cascade: FEC (FlexFEC, Reed-Solomon codes, adaptive rate, 2D interleaving for burst loss) → NACK retransmission (RTX stream, only when RTT < jitter buffer depth × 0.6) → Concealment (frame freeze, motion compensation, Opus PLC).

Keyframes get 2-3x more FEC redundancy. Audio has triple protection: Opus in-band FEC + RFC 2198 redundancy + FlexFEC. Audio survives up to 20-30% packet loss with barely perceptible degradation. Always reduce video bitrate BEFORE adding FEC to avoid the FEC death spiral.

07

Key Design Decisions & Tradeoffs

Media Routing

✓ Chosen

SFU (Selective Forwarding)

Forward encrypted packets without decode. 50x cheaper than MCU, adds only 1-5ms latency, preserves E2EE capability. Combined with simulcast for per-receiver quality adaptation.

✗ Alternative

MCU (Multipoint Control Unit)

Decode + composite + re-encode per meeting. ~20 CPU cores for a 10-person meeting. At 10M concurrent meetings → 200M cores. Also adds 100-200ms transcoding latency and breaks E2EE.

Transport Protocol

✓ Chosen

UDP with App-Level Reliability

No head-of-line blocking. Build custom reliability: FEC for proactive recovery, NACK for retransmission, jitter buffers for reordering, concealment for unrecoverable losses.

✗ Alternative

TCP (reliable transport)

Head-of-line blocking is fatal: one lost packet at 100ms RTT causes 100ms stall for ALL subsequent packets. Manifests as periodic freezes + fast-forward — far worse than a brief quality dip.

Encryption Model

✓ Chosen

Transport Encryption Default, E2EE Opt-in

Preserves recording, transcription, server mixing, PSTN dial-in, compliance monitoring. E2EE available for sensitive meetings where participants accept feature tradeoffs.

✗ Alternative

E2EE by Default

Disables all server-side media features. Client must decode and mix all audio locally (CPU-intensive). No PSTN dial-in. No compliance recording. Active speaker detection relies on spoofable client-reported levels.

Deployment Model

✓ Chosen

80+ Edge PoPs Globally

First/last hop <10ms. Physics-driven: mouth-to-ear budget is 200ms, can't waste 50ms+ on the first hop. Each metro gets local media servers. SFU mesh is a real-time CDN.

✗ Alternative

5-10 Cloud Regions

Simpler operations but forces 200ms+ one-way latency for inter-continental meetings. Budget blown before any processing occurs. Unacceptable for conversational media.

Video Codec

✓ Chosen

VP9 + H.264 Fallback

VP9 delivers 30% better compression + native SVC. H.264 mandatory fallback for Safari/iOS (~25% of users). AV1 adopted on capable hardware. Mixed-codec meetings via dual-publish or unification.

✗ Alternative

H.264 Only

Universal hardware support but worst compression. At 50M concurrent participants, 30% worse compression = petabits of wasted bandwidth = hundreds of millions in annual cost difference.

08

What Can Go Wrong

SFU Server Crash Mid-Meeting

1,200 participants lose media. Clients detect WebSocket drop (3-5s), reconnect to signaling, get assigned new SFU, renegotiate SDP+ICE. 3-8 second interruption. Meeting state preserved in signaling layer. Mitigated by health checks, graceful draining, and stateless SFU design.

Entire PoP Outage

All participants routing through that PoP lose connectivity. Failover to next-nearest PoP (35-80ms RTT increase). 10-30 second outage. Mitigated by multi-path connectivity, BGP anycast, client-side PoP failover lists, and capacity headroom in adjacent PoPs.

Client Network Degradation (WiFi Death Spiral)

Roommate starts Netflix → available bandwidth drops from 10 to 2 Mbps. GCC detects within 500ms → drop simulcast layers → enable FEC → audio-only if needed. Audio always gets priority. Full adaptation takes 1-8 seconds. System suggests network switch if consistently poor.

Thundering Herd — Mass Meeting Start

50K employees join all-hands at 9:00 AM. Mitigated by: join rate limiting with queue, participant list pagination, batched state broadcasts, meeting mode escalation (auto-transition to webinar), and SFU pre-provisioning for scheduled large meetings.

Recording Pipeline Failure

Recording agent crash: last 5-10s lost (multipart uploads saved previous chunks to S3). Compositor failure: job returns to queue, retried automatically — raw tracks retained 7 days. Redundant recording agents for compliance-critical meetings.

Cascading Retry Storm

100K clients reconnect simultaneously after signaling server recovery → crash again → loop. Mitigated by exponential backoff with jitter (spreads reconnections over 12s), server-side connection rate limiting (1000/s), circuit breaker, and load shedding (prioritize ongoing meetings over new joins).

Overarching principle: Media continues flowing even when everything else breaks. The SFU forwards packets based on local state — it doesn't query Redis or depend on signaling. Meeting metadata and mute buttons can break temporarily, and the conversation continues.

09

Interview Tips

💡
Lead with the Three Planes
Immediately establish that this system has three fundamentally different planes (control, signaling, media) that scale independently. Most candidates draw one monolithic backend — separating them signals senior-level thinking.
Start with Physics, Not Components
Don't jump to "I'll use Kafka and Redis." Start with: speed of light = 200K km/s, mouth-to-ear budget = 200ms, Dubai→London = 80ms RTT. This FORCES edge deployment — let the numbers drive the architecture.
🎯
Nail the SFU Explanation
"SFU forwards encrypted packets without decoding — a smart packet router. Combined with simulcast (3 quality layers), each receiver gets bandwidth-adapted quality. 50x cheaper than MCU, adds 1-5ms latency. The SFU mesh is a real-time CDN."
🔑
The Degradation Ladder Is Your Secret Weapon
Most candidates describe the happy path. Walk through: resolution → frame rate → video off → audio quality → narrowband audio. "Audio always wins" shows you understand real-time system behavior under stress.
🗣️
Use Domain Vocabulary
Say "SFU" not "video server." Say "simulcast layers" not "different quality videos." Say "mouth-to-ear latency" not "delay." Say "cascading" not "server forwarding." Say "TWCC" not "congestion detection." Using terms naturally signals domain experience.
⏱️
Time Budget: 35 Minutes
0-3 min: clarify scope. 3-8: requirements + scale. 8-15: high-level architecture (three planes). 15-20: API design. 20-30: deep dive (SFU internals or congestion control). 30-35: tradeoffs + failures. Don't spend 15 minutes on the control plane — it's a standard web app.
🚫
Things to AVOID
Don't say "We'll just use WebRTC" (it's a browser API, not architecture). Don't use Kafka for video (latency!). Don't suggest TCP for media (head-of-line blocking). Don't suggest CDN caching (nothing to cache). Don't discuss database schemas (nobody cares for this problem).
11

Evolution

How this design grows from a weekend prototype to a planet-scale conferencing platform.

1

MVP — "It Works on My Laptop" (0-1K users)

Single server running SFU + signaling + API + PostgreSQL. Mesh for 1:1, single SFU for groups up to 6. H.264 only, no simulcast, no recording. Open-source mediasoup/Janus. One cloud VM at $50/month.

2

Multi-Server — "Paying Customers" (1K-50K users)

Separate concerns: API, signaling, SFU, TURN as independent services. Deploy in 2-3 cloud regions. Add simulcast (3 layers), VP9, Redis for state, basic recording with CPU-based FFmpeg compositing. Screen sharing, waiting rooms, host controls. $2K-10K/month.

3

Production-Grade — "Enterprise Calling" (50K-5M users)

Expand to 15-20 PoPs with SFU cascading. VP9 SVC, adaptive FEC (FlexFEC), TWCC bandwidth estimation, server-side audio mixing. GPU recording, E2EE (MLS), NATS JetStream, event sourcing. SSO/SAML, compliance recording, live transcription. $50K-200K/month.

4

Scale — "Competing with Zoom" (5M-100M users)

80-100+ PoPs in 3 tiers. Dedicated backbone between Tier 1 hubs. Dynamic per-meeting cascade topology. Meeting mode escalation (interactive → webinar → broadcast). AV1 adoption, content-aware encoding, multi-path transport. AI features: noise suppression, smart chapters, summaries. $1-5M/month.

5

Planet-Scale — "We ARE Zoom" (300M+ users)

150+ PoPs + sovereign deployments (China, EU, Russia). Own backbone (submarine cable partnerships). Custom silicon (FPGA SFU, video encoding ASICs, SmartNICs). AI-native: real-time translation, generative codec research (face landmarks at 5 Kbps), meeting Q&A. Platform SDK ecosystem. $30M+/month.

Next up