06
Deep Dives — 10 Rabbit Holes
Video conferencing is uniquely rich in deep technical domains. Each deep dive covers the single most interesting aspect of its subsystem.
#1 SFU vs MCU vs Mesh — The architectural backbone
#2 WebRTC Signaling — ICE, SDP, NAT traversal
#3 Adaptive Bitrate — GCC, TWCC, congestion control
#4 Audio Pipeline — AEC, mixing, jitter buffers
#5 E2E Encryption — MLS, SFrame, trust models
#6 Global Cascading — PoP placement, backbone
#7 Recording Pipeline — Capture, composite, GPU
#8 Scalable Signaling — Meeting state at 300M
#9 Codec Selection — VP9, AV1, H.264 tradeoffs
#10 Packet Loss Recovery — FEC, NACK, concealment
06.1
SFU vs MCU vs Mesh
The single most important architectural decision in video conferencing. Every other choice cascades from it.
| Factor | Mesh | MCU | SFU |
| Server CPU | None | Extreme | Minimal |
| Client upload | N-1 streams | 1 stream | 1 stream (3 layers) |
| Added latency | ~0ms | 100-200ms | 1-5ms |
| Per-receiver quality | No | No | Yes (simulcast) |
| E2EE possible | Yes | No | Yes |
| Max video participants | 3-4 | 50-100 | 49 video / 1000 audio |
| Cost at scale | $0 | $$$$$$ | $$ |
SFU never decodes or encodes video — it forwards encrypted RTP packets at the packet level. Combined with simulcast (sender encodes 3 quality layers: 720p/360p/180p), each receiver gets quality-optimized per their bandwidth. The SFU makes per-subscriber, per-stream forwarding decisions — a metadata operation, not a transcoding operation.
Production hybrid: 1:1 calls → Mesh. Small groups → Single SFU. 10-49 → SFU with aggressive simulcast. 50-1000 → SFU + server audio mixing. 1000+ → SFU for panelists + CDN/HLS for audience.
06.2
WebRTC Signaling & Connection Establishment
Before a single frame of video flows, there's an elaborate dance: SDP offer/answer for codec negotiation, ICE for NAT traversal (gathering host/srflx/relay candidates, connectivity checks), and DTLS for encryption key exchange.
~85% of users connect via STUN (direct UDP), ~15% need TURN relay (restrictive NATs/firewalls). Total time from click to first media: ~700ms, optimized to ~500ms with pre-gathering, STUN caching, and DTLS session resumption.
Chain of trust: Signaling TLS → SDP fingerprint → DTLS → SRTP keys. The signaling server relays negotiation but never sees media.
06.3
Adaptive Bitrate & Congestion Control
Google Congestion Control (GCC) uses two parallel estimators: a delay-based controller (Kalman filter on inter-arrival time gradients) and a loss-based controller. The system is fast to downgrade (15% reduction) and slow to upgrade (8% probe) — protecting real-time experience over maximizing quality.
TWCC (Transport-Wide Congestion Control) provides per-receiver bandwidth estimates at the SFU, enabling independent simulcast layer selection per subscriber. Alice on fiber gets 720p while Bob on mobile gets 360p — from the same sender.
Degradation Ladder
>2.5 Mbps: 720p30 + 360p30 + 180p30 → 1.5-2.5: 720p15 + 360p30 → 0.8-1.5: 360p30 + 180p15 → 0.3-0.8: 360p15 → <0.3: Audio only → <80 Kbps: Opus narrowband. Audio ALWAYS wins.
06.4
Audio Pipeline — AEC, Mixing, Jitter Buffers
Users tolerate terrible video but abandon calls within 10 seconds of bad audio. The pipeline: Capture → Noise Suppression (neural network, RNNoise-style) → AEC (adaptive NLMS filter, 100-300ms taps) → AGC → VAD → Opus encode (20ms frames, in-band FEC) → Network → Jitter Buffer (adaptive, 30-60ms) → Decode → Mix → Playback.
AEC is the hardest DSP problem — modeling room acoustics in real-time, handling double-talk detection, non-linear speaker distortion, and variable system latency (Android: 50-200ms, highly variable). Server-side audio mixing for large meetings selects top-3 loudest speakers with hysteresis, creating personalized N-speaker mixes excluding each participant's own audio.
06.5
End-to-End Encryption
Double encryption: SFrame encrypts the payload with a meeting key (SFU can't decrypt), SRTP encrypts the transport. Key exchange via MLS protocol (RFC 9420) using a ratchet tree — O(log N) rekeying on participant join/leave vs O(N²) for sender keys.
E2EE disables: server-side recording, live transcription, server audio mixing, PSTN dial-in, and compliance monitoring. This is why it's opt-in, not default — the industry consensus across Zoom, Teams, and Meet.
06.6
Global Media Server Placement & Cascading
80-150+ PoPs globally in three tiers: 15 mega-PoPs (cascade hubs), 40 regional, 50+ micro-PoPs at ISP peering points. Each unique stream crosses any inter-PoP link exactly once, regardless of subscriber count on each side.
Cascade topology is dynamic and per-meeting: direct link for 2 PoPs, star for 3-5, minimum latency spanning tree for 6+. Dedicated backbone between Tier 1 PoPs delivers consistent 70-85ms RTT with <0.1% loss, vs public internet's variable 80-140ms with 0.5-3% loss.
06.7
Recording & Compositing Pipeline
Separate capture from compositing. Recording agent writes raw encoded tracks to S3 during the meeting (just file I/O, ~0.1 CPU cores per meeting). Post-meeting, GPU workers composite into gallery/speaker view MP4 — decode all tracks, synchronize via RTP timestamps, layout computation, composite, re-encode with H.264+AAC.
GPU compositing: ~2 minutes per hour of meeting. CPU-only: ~30 minutes. GPU is non-negotiable at scale. Lazy generation: only gallery view by default, speaker view and individual tracks on demand — reduces GPU usage by ~60%.
06.8
Scalable Signaling — Meeting State at 300M Users
Meeting state lives in Redis (hot) + PostgreSQL (cold). Mutations use optimistic concurrency with epoch numbers. Cross-server broadcast via Redis Pub/Sub (same region) + NATS JetStream (cross-region). Meeting-affine routing places all small-meeting participants on the same signaling server, eliminating pub/sub overhead.
Meeting state is single-leader in the host's region — all mutations route there. Non-home participants accept ~80-120ms extra latency on state changes (acceptable for the 500ms signaling budget). Event sourcing provides full audit trail, state reconstruction, analytics, and compliance.
06.9
Codec Selection — VP8/VP9/AV1/H.264
VP9 is the current default (30% better compression than H.264, native SVC). H.264 is mandatory fallback (Safari/iOS only supports H.264 for WebRTC — 25% of users). AV1 is the future (60-80% better than H.264) but hardware encode isn't universal yet.
Mixed-codec meetings handled by dual-publish (VP9 sender also sends H.264) or codec unification. Audio is settled: Opus won — royalty-free, better than every competitor at every bitrate, seamless speech/music switching, built-in FEC.
06.10
Last-Mile Quality — FEC, NACK, Packet Loss Recovery
WiFi is the primary villain (contention, interference, bufferbloat). Three recovery strategies in priority cascade: FEC (FlexFEC, Reed-Solomon codes, adaptive rate, 2D interleaving for burst loss) → NACK retransmission (RTX stream, only when RTT < jitter buffer depth × 0.6) → Concealment (frame freeze, motion compensation, Opus PLC).
Keyframes get 2-3x more FEC redundancy. Audio has triple protection: Opus in-band FEC + RFC 2198 redundancy + FlexFEC. Audio survives up to 20-30% packet loss with barely perceptible degradation. Always reduce video bitrate BEFORE adding FEC to avoid the FEC death spiral.