Design PUBG

01

Problem Statement

Design the backend infrastructure for a battle royale game at PUBG scale. 100 players connect to a single game server, drop onto an 8km × 8km map, collect weapons and equipment, and fight in real-time combat while a shrinking zone forces encounters. The last player or squad standing wins.

Unlike traditional web systems, this is a real-time simulation problem. The server must process inputs, simulate physics, detect hits, and broadcast state to all players every 50 milliseconds (at 20 Hz tick rate). Latency is measured in frames, not seconds.

Core question: How do you synchronize the real-time state of 100 players, thousands of items, projectiles, and a shrinking zone across unreliable networks with sub-100ms perceived latency?

The system has two distinct halves: a platform layer (matchmaking, profiles, inventory — standard microservices) and a game server layer (real-time simulation — ephemeral, stateful, latency-critical). The architecture challenge is the bridge between them.

02

Requirements

Functional Requirements

Matchmaking — Group ~100 players of similar skill and region into a lobby, start when full
Game world — Large map with loot spawns, vehicles, destructible objects
Real-time combat — Player positions, shooting, hit detection, health at 20–60 Hz
Shrinking zone — Timed circle that damages players outside it, forcing encounters
Inventory system — Pick up, drop, equip weapons, attachments, and consumables
Squad support — Solo, duo, and squad (4-player) modes with voice chat
Match results — Kill feed, final standings, persistent stats (K/D, wins, rank)
Spectating — Watch after death, anti-cheat replay data collection

Non-Functional Requirements

Low latency — Server tick rate of 20–60 Hz, perceived client latency < 100ms
Consistency — Authoritative server for all critical game actions; no desync on hit registration
Scalability — Support millions of concurrent players across thousands of simultaneous matches
Availability — Matchmaking and platform services highly available; individual match servers are ephemeral
Anti-cheat — Server-side validation of all game actions, client-side kernel driver, post-match ML analysis

03

Scale Estimation

All numbers derived from assumptions, not pulled from thin air. Starting from 50M MAU (PUBG's peak scale).

1.5M

Peak CCU

12,000

Concurrent Matches

750K

Matches / Day

~115 Gbps

Aggregate Bandwidth

~3,000

Game Servers at Peak

~10 Mbps

Bandwidth per Match

~4K/s

Matchmaking Req (Peak)

~100 GB

Player Profile Storage

Derivation

50M MAU → 15M DAU (30% daily active) → 1.5M CCU (10% concurrent at peak). 80% in-match = 1.2M players in matches. At 100 per match = 12,000 concurrent matches. Each player plays ~5 matches/day × 15M ÷ 100 = 750K matches/day.

Bandwidth: ~20 entities in AOI × 30 bytes/entity × 20 ticks/sec = ~700 bytes/tick. With delta compression: ~12 KB/s per player bidirectional. 100 players × 12 KB/s = 1.2 MB/s per match (~10 Mbps). 12K matches × 10 Mbps = ~115 Gbps aggregate (spread across 3K servers).

Compute: Each match fits on 2–4 CPU cores. 4–8 matches per bare-metal node = 1,500–3,000 game servers at peak. Servers are ephemeral (~25 min lifecycle), recycled at ~480/min.

Key insight: Network is the bottleneck, not storage. 115 Gbps aggregate means game servers must be regionally distributed, close to players. Player profile DB is only ~100 GB — trivial.

04

API Design

APIs split into two categories: Platform APIs (REST/gRPC for matchmaking, profiles, inventory) and the Game Protocol (custom binary UDP for real-time state sync).

Platform APIs (REST over HTTPS)

Matchmaking — Queue

POST /api/v1/matchmaking/queue
Authorization: Bearer <session_token>

{ "mode": "squad", "region": "ap-south-1", "map": "erangel", "perspective": "tpp" }

→ 202 Accepted
{ "ticket_id": "tkt_a8f3...", "estimated_wait_sec": 12, "status": "queued" }

Matchmaking — Poll for Match

GET /api/v1/matchmaking/ticket/{ticket_id}

→ 200 OK (when matched)
{
  "status": "matched",
  "match_id": "match_7bc2...",
  "game_server": {
    "host": "gs-ap-south-42.pubg.internal",
    "port": 7777, "udp_port": 7778,
    "token": "eyJhbG..."
  },
  "lobby_players": 97, "map": "erangel"
}

Player Profile & Stats

GET /api/v1/players/{player_id}/profile

→ 200 OK
{
  "player_id": "p_9d3f...", "username": "ShroudLite", "level": 47,
  "rank": { "tier": "platinum", "division": 2, "rating": 1847 },
  "lifetime_stats": {
    "matches_played": 3241, "wins": 186, "kills": 9720,
    "kd_ratio": 3.42, "avg_damage": 312.5
  }
}

Game Server Protocol (Real-Time, UDP)

Once matched, everything switches to a custom binary UDP protocol. REST doesn't work here — TCP's head-of-line blocking adds 100–300ms spikes on packet loss, and the server must broadcast state every 50ms.

Client → Server (Upstream)

PLAYER_INPUT (16 bytes, every frame): sequence number, timestamp, WASD direction, look angles, action bitfield (jump/crouch/fire/ADS/reload), weapon slot.

INTERACT (10 bytes, on action): pick up item, open door, enter vehicle. Includes target object ID.

VOICE_FRAME (~80 bytes, 50/sec while talking): Opus-encoded 20ms audio frame for squad voice chat.

Server → Client (Downstream)

WORLD_STATE_DELTA (~400–600 bytes, every tick): server tick number, last ACK'd input sequence, array of nearby player states (position, rotation, velocity, animation, health, weapon). Delta-compressed.

HIT_CONFIRM (6 bytes, reliable): damage dealt, hit zone, target health, kill flag.

ZONE_UPDATE (20 bytes, reliable): new circle center, radius, shrink speed, damage per second.

Reliability split: Position updates are unreliable (fire-and-forget — stale data is worse than lost data). Zone updates, hit confirms, and kill events are reliable-ordered (ACKs + retransmission). Libraries like ENet or Valve's GameNetworkingSockets provide both channels over a single UDP socket.

Layer	Protocol	Auth	Peak Rate
Matchmaking	REST / HTTPS	JWT (session)	~4K req/sec
Profile / Stats	REST / HTTPS	JWT	~50K req/sec (cached)
Inventory / Shop	REST / HTTPS	JWT	~10K req/sec
Social / Squad	REST + WebSocket	JWT	~5K req/sec
Game state sync	Custom binary UDP	Match token	20–60 ticks/sec × 100
Voice chat	UDP (Opus frames)	Match token	50 frames/sec

05

High-Level Architecture

Two parallel worlds: an always-on platform layer (standard microservices) and an ephemeral game server layer (stateful simulation instances that live for one match). The bridge between them is the Game Server Orchestrator.

Regional Architecture

Each of 5–6 major regions is nearly self-contained: its own matchmaker, game server fleet, orchestrator, database replicas, and Kafka cluster. Players are matched within their region for latency. Only account/auth, shop/payments, and global leaderboards are truly global services synced asynchronously across regions.

Game Server Orchestrator

The most unique component — doesn't exist in typical web systems. Maintains a warm pool of pre-provisioned game server containers. When the matchmaker fills a lobby, it requests a server from the pool (< 2 seconds), avoiding cold-start delays. Pre-warming uses predictive scaling based on time-of-day and historical demand. At peak, the orchestrator recycles ~480 servers per minute.

Request Flow — Step Through

Client→Auth Service→Matchmaker→Orchestrator→Game Server→Tick Loop (×600)→Kafka→Match Processor

Click Next Step to walk through the request flow.

06

Deep Dive — Real-Time Game Networking

Everything else in this design is standard distributed systems. The hard, novel problem is keeping 100 players in sync on an 8km map with sub-100ms perceived latency over unreliable networks. Four interlocking techniques make this work.

The fundamental problem: The server and every client are always looking at different moments in time because of network latency. A player in Mumbai is ~30ms from the server; add interpolation delay and they see the world ~80ms in the past. The netcode's job is to create the illusion of a shared present moment.

Client

1. Client-Side Prediction

Your inputs are applied locally and instantly — the client runs the same physics engine as the server. You don't wait for the server to confirm movement. The client "predicts" the result. This is why your own movement always feels responsive.

Client

2. Server Reconciliation

When the server's authoritative state arrives, the client checks if its predictions were correct. If not, it re-simulates all unconfirmed inputs on top of the server's position. Small errors are blended smoothly; large errors cause a visible "rubber-band" snap.

Client

3. Entity Interpolation

Other players' positions arrive at 20 Hz but must render at 60–144 FPS. The client renders them one tick behind (~50ms in the past), smoothly interpolating between the two most recent snapshots. This is why enemies look smooth even at low tick rates.

Server

4. Lag Compensation

When you fire, you're aiming at a target ~80ms in the past. The server rewinds hitboxes to the time you saw the world and checks your shot there. This makes shooting feel accurate, but means targets can occasionally die "behind cover."

Client-Side Prediction — How It Works

Without prediction, pressing W results in 83ms of nothing (30ms to server + 3ms processing + 50ms back), then a sudden teleport. At 60 FPS, that's 5 frames of your character ignoring you. Unplayable.

With prediction: the client immediately applies your input to a local physics simulation — the same code the server runs. It stores each predicted result in a prediction buffer (ring buffer, ~128 entries = 2 seconds of inputs). When the server confirms input #N, the client discards entries up to N and re-simulates everything after N on top of the server's authoritative position.

The re-simulation typically processes 4–14 frames and takes < 0.1ms — invisible in the frame budget. When prediction matches (99% of the time), the player sees nothing. When it doesn't, the correction is blended over 50–200ms depending on error magnitude.

Lag Compensation — The Shot Sequence

sequenceDiagram participant A as Player A (Shooter) participant S as Game Server participant B as Player B (Target) Note over A: Sees B at position
from 80ms ago A->>A: Click fire — muzzle flash
plays instantly (prediction) A->>S: PLAYER_INPUT {fire, yaw, pitch, seq:47} Note over S: Receives after ~30ms S->>S: Calculate A's view time
T - 30ms - 50ms = T-80ms S->>S: Rewind B's hitbox to T-80ms S->>S: Raycast → HIT on B's torso S->>S: Apply 38 HP damage (AKM) S-->>A: HIT_CONFIRM {damage:38, zone:torso} S-->>B: DAMAGE_EVENT {from:A, dmg:38, dir:NW} Note over A: ~63ms total — sees hit marker Note over B: ~63ms total — takes damage,
screen shakes

The server maintains a position history buffer for every player (last 1 second, 20 entries × ~40 bytes = 80 KB total). On a shot, it interpolates between the two history entries bracketing the shooter's view time, reconstructs hitboxes, and checks the ray. The rewind is capped at 250ms to prevent abuse from players intentionally adding latency.

The Peeker's Advantage — An Unavoidable Tradeoff

Because of lag compensation, the player who peeks a corner has a ~80ms advantage over the defender. The peeker sees the defender immediately, but the defender sees the peeker ~80ms late. This is a fundamental consequence of physics (speed of light + network latency) — there is no solution that eliminates it without breaking hit registration for the shooter. Every competitive FPS makes this same tradeoff.

The Tick Budget

At 20 Hz, the server has 50ms per tick to process everything:

Phase	Budget	What Happens
Network receive	~2ms	Read all queued UDP packets from 100 clients
Input processing	~3ms	Apply 100 players' inputs to world state
Physics simulation	~8ms	Movement, collision, vehicles, projectiles
Hit detection	~5ms	Lag-compensated raycasts for active shooters
Game logic	~4ms	Zone damage, loot, airdrops, kills/knocks
Anti-cheat	~3ms	Speed checks, fire rate, position sanity
State broadcast	~10ms	Per-player AOI filter + delta compress + send
Headroom	~15ms	Absorbs late-game spikes (30 players in small circle)

Area of Interest (AOI) — Spatial Grid

The 8km × 8km map is divided into 500m × 500m cells (16 × 16 = 256 cells). Each player belongs to one cell. Their AOI = their cell + 8 neighbors (3×3 grid). Only entities within these 9 cells are included in their state update — cutting broadcast from 100 to ~20 players. Cell lookup is O(1) by hashing position to cell index.

Delta Compression — 60–80% Bandwidth Savings

Instead of sending full state every tick, the server tracks what each client has ACK'd and sends only changed fields. A field bitmask (1 byte) indicates which fields follow. If only position changed: 7 bytes instead of 19 bytes per entity. If nothing changed: 1 byte. Typical savings: ~82%.

07

No invasive client software. But cannot detect wallhacks (client renders what server sends), and "human-plausible" aimbots slip through. Only catches ~40% of cheats. Game reputation suffers.

08

What Can Go Wrong

🔥 Game Server Crash Mid-Match

Server process crashes at minute 18 — 100 players lose their match. Causes: memory leaks, unhandled edge cases (vehicle + zone boundary + revival simultaneously). Mitigation: Checkpoint state to Redis every 30 sec for crash recovery (complex, rarely implemented). Practically: invest in crash prevention — canary deploys, watchdog processes, memory thresholds that force graceful early match end. Acceptable rate: < 0.05% of matches.

⏳ Matchmaking Starvation at Low Population

4 AM in a small region — only 12 players in the platinum squad TPP queue. Queue time exceeds 5 minutes. Players leave → fewer players → longer queues → death spiral. Mitigation: Progressive bracket widening (30s: merge skill tiers, 60s: merge perspectives, 90s: fill with bots, 120s: start with 60–80 players). Cross-region matching as last resort (120–150ms ping but playable).

🔄 Desync — Client and Server Disagree

Player sees themselves behind a rock; server thinks they're exposed. Causes: floating-point divergence between client and server physics, packet loss bursts, clock drift, physics step mismatch (client at 144 FPS vs server at 20 Hz). Mitigation: Quantize positions to 1cm grid, force full state sync every 5 seconds, use fixed physics timestep on both sides, NTP-like clock synchronization.

🛡️ DDoS on Game Servers

Attacker discovers game server IPs (exposed via UDP connection handshake) and floods a specific match — especially during esports tournaments. Mitigation: Route through UDP relay/proxy (game server IP never exposed, +1–2ms latency), cloud DDoS protection (AWS Shield, Cloudflare Spectrum), kernel-level packet filtering (eBPF/XDP) dropping packets from non-player IPs.

📊 Database Hot Spots — Match-End Storm

When a match ends, 100 player stats must be updated simultaneously. Multiple matches ending together = thousands of writes in a burst. Mitigation: Game servers emit results to Kafka (fire-and-forget, < 1ms). Consumers process at steady rate. Batch all 100 updates into a single SQL transaction. Stat aggregation runs async via Flink, delayed 30–60 seconds.

🎯 Cheater Ruins Competitive Match

Aimbot user climbs to top ranks, ruining every match. Affects 99 players per match directly, destroys game reputation long-term. Mitigation: Three-layer anti-cheat (kernel + server + ML). Phone verification for ranked. Hardware ID bans. When a cheater is banned, retroactively void their last N matches and recalculate affected players' ranks.

09

Interview Tips

💡

Frame the problem as "two systems, not one."
The platform layer (matchmaking, profiles, inventory) is standard microservices — sketch it quickly. The game server layer (real-time UDP simulation) is where the novel engineering is. Say: "I'll spend 70% of our time on the game server because that's where the unique challenges are."

⚡

Let numbers drive architecture.
Don't say "we need lots of game servers." Say: "At 1.5M CCU with 100 players per match, we have 12,000 concurrent matches. Each needs 2–4 cores. That's 1,500–3,000 servers, recycling at 480/min — which tells us we need a fast orchestration layer and a warm pool."

🎯

The networking deep dive is your superpower.
Most candidates say "WebSockets" and stop. Explaining any one of the four netcode techniques (prediction, reconciliation, interpolation, lag compensation) in depth shows real-time systems understanding that most candidates can't match.

🔑

Show the scaling asymmetry.
"Unlike web services where we add stateless instances behind a load balancer, each game server is a stateful island that lives for one match. We can't move a player between servers or merge half-empty matches. This means we need a custom scheduler for ephemeral game instances."

🧠

Have a clear answer for "What's the hardest part?"
"Making shooting feel fair across varying network conditions. Do I check the shot against where the target IS now, or where they WERE when the shooter saw them? There's no perfect answer — it's a fundamental consequence of the speed of light. The engineering is building a rewind system with the right cap to feel as fair as possible for both players."

⚠️

Flag what's different from web systems.
Protocol: UDP, not HTTP. State: ephemeral in-memory, not persistent. Consistency: strong within a match (server-authoritative), eventual for platform services. Failure mode: match is disposable — invest in crash prevention, not crash recovery.

10

Evolution

How this design grows from MVP to planet-scale.

1

MVP — Single Region, 50 Players

One region, 50-player matches, 10 Hz tick rate, single map, solo mode only, no ranked matchmaking, server-side anti-cheat only, no replay system. Game servers run as plain processes on 10–20 EC2 instances with manual scaling. Capacity: ~5,000 CCU, ~100 concurrent matches. Team: 3–5 engineers. Goal: validate the core gameplay loop — can players connect, move, shoot, loot, die, and win?

2

Launch-Ready — Multi-Region, 100 Players

Scale to 100 players, 20 Hz adaptive tick rate (20→30→60 as players die), 3 regions, squad modes, container-based servers with custom orchestrator and warm pools, Kafka for match results, Redis caching, client-side anti-cheat (EasyAntiCheat), basic cosmetics. Full netcode stack: prediction, reconciliation, interpolation, lag compensation with 200ms cap, delta compression, AOI grid. Capacity: ~200K CCU, ~2,000 concurrent matches. Team: 15–25 engineers.

3

Scale — Millions of Players

5–6 fully independent regions, Kubernetes + Agones orchestration, bot backfill for low-pop queues, ranked seasons with ELO, full cosmetics economy with battle pass, replay system (5% stored), post-match ML anti-cheat, global leaderboards, spectator mode for esports, cross-platform play (PC + console + mobile with per-platform optimizations). Hybrid infra: bare metal baseline + cloud burst. Capacity: ~1.5M CCU, ~12,000 concurrent matches. Team: 50–100 engineers.

4

Planet-Scale — Live Service

Custom bare-metal fleet in 10+ regions with edge compute in ISP PoPs. ML-driven matchmaking (play style, toxicity, connection quality). A/B testing framework for game mechanics. Advanced anti-cheat: behavioral fingerprinting, hardware attestation, legal takedowns. User-generated content (custom modes, map editor). Streaming integration (Twitch drops, anti-stream-sniping). 200+ engineers across multiple studios. The hard problems become organizational (modular codebase for 200 devs), operational (rolling deploys with 2–3 server versions live), and financial ($3–5M/month infrastructure — every 10% efficiency gain saves $500K/year).

📺

References & Videos

Game Server Architecture

Hussein Nasser · 25 min

Riot Games Tech Blog

Riot Games

Problem Statement

Requirements

Functional Requirements

Non-Functional Requirements

Scale Estimation

Derivation

API Design

Platform APIs (REST over HTTPS)

Game Server Protocol (Real-Time, UDP)

Client → Server (Upstream)

Server → Client (Downstream)

High-Level Architecture

Regional Architecture

Game Server Orchestrator

Deep Dive — Real-Time Game Networking

1. Client-Side Prediction

2. Server Reconciliation

3. Entity Interpolation

4. Lag Compensation

Client-Side Prediction — How It Works

Lag Compensation — The Shot Sequence

The Peeker's Advantage — An Unavoidable Tradeoff

The Tick Budget

Area of Interest (AOI) — Spatial Grid

Delta Compression — 60–80% Bandwidth Savings

Key Design Decisions & Tradeoffs

1. Server-Authoritative vs Client-Authoritative

Server-Authoritative

Client-Authoritative

2. Tick Rate — 20 Hz vs 60 Hz vs 128 Hz

Adaptive 20→60 Hz

Fixed 128 Hz

3. UDP vs TCP for Game Traffic

Custom UDP Protocol

TCP / QUIC

4. Lag Compensation — Favor Shooter vs Favor Target

Favor Shooter (rewind, 250ms cap)

Favor Target (no rewind)

5. Game Server Lifecycle — Containers vs Bare Metal

Hybrid (Bare Metal + Cloud Containers)

VMs Only

6. AOI — Fixed Spatial Grid vs Quadtree

Fixed 500m × 500m Grid

Quadtree / Adaptive

7. Anti-Cheat — Kernel-Level vs Server-Only

Three-Layer Defense

Server-Only Validation

What Can Go Wrong

🔥 Game Server Crash Mid-Match

⏳ Matchmaking Starvation at Low Population

🔄 Desync — Client and Server Disagree

🛡️ DDoS on Game Servers

📊 Database Hot Spots — Match-End Storm

🎯 Cheater Ruins Competitive Match

Interview Tips

Similar Problems

Uber / Real-Time Location

WhatsApp / Chat System

YouTube / Netflix

Distributed Queue (Kafka)

MMO Game Server

Fortnite / Apex Legends

Evolution

MVP — Single Region, 50 Players

Launch-Ready — Multi-Region, 100 Players

Scale — Millions of Players

Planet-Scale — Live Service

References & Videos

Uber / Real-Time Location

WhatsApp / Chat System

TCP vs UDP