Design PUBG

Design a multiplayer battle royale system where 100 players drop onto a large map, scavenge for equipment, and fight until one player or squad survives — all synchronized in real-time over unreliable networks.

Real-Time NetworkingUDP ProtocolEphemeral ServersLag CompensationSpatial Partitioning
01

Problem Statement

Design the backend infrastructure for a battle royale game at PUBG scale. 100 players connect to a single game server, drop onto an 8km × 8km map, collect weapons and equipment, and fight in real-time combat while a shrinking zone forces encounters. The last player or squad standing wins.

Unlike traditional web systems, this is a real-time simulation problem. The server must process inputs, simulate physics, detect hits, and broadcast state to all players every 50 milliseconds (at 20 Hz tick rate). Latency is measured in frames, not seconds.

Core question: How do you synchronize the real-time state of 100 players, thousands of items, projectiles, and a shrinking zone across unreliable networks with sub-100ms perceived latency?

The system has two distinct halves: a platform layer (matchmaking, profiles, inventory — standard microservices) and a game server layer (real-time simulation — ephemeral, stateful, latency-critical). The architecture challenge is the bridge between them.

02

Requirements

Functional Requirements

  • Matchmaking — Group ~100 players of similar skill and region into a lobby, start when full
  • Game world — Large map with loot spawns, vehicles, destructible objects
  • Real-time combat — Player positions, shooting, hit detection, health at 20–60 Hz
  • Shrinking zone — Timed circle that damages players outside it, forcing encounters
  • Inventory system — Pick up, drop, equip weapons, attachments, and consumables
  • Squad support — Solo, duo, and squad (4-player) modes with voice chat
  • Match results — Kill feed, final standings, persistent stats (K/D, wins, rank)
  • Spectating — Watch after death, anti-cheat replay data collection

Non-Functional Requirements

  • Low latency — Server tick rate of 20–60 Hz, perceived client latency < 100ms
  • ConsistencyAuthoritative server for all critical game actions; no desync on hit registration
  • Scalability — Support millions of concurrent players across thousands of simultaneous matches
  • Availability — Matchmaking and platform services highly available; individual match servers are ephemeral
  • Anti-cheat — Server-side validation of all game actions, client-side kernel driver, post-match ML analysis
03

Scale Estimation

All numbers derived from assumptions, not pulled from thin air. Starting from 50M MAU (PUBG's peak scale).

1.5M
Peak CCU
12,000
Concurrent Matches
750K
Matches / Day
~115 Gbps
Aggregate Bandwidth
~3,000
Game Servers at Peak
~10 Mbps
Bandwidth per Match
~4K/s
Matchmaking Req (Peak)
~100 GB
Player Profile Storage

Derivation

50M MAU → 15M DAU (30% daily active) → 1.5M CCU (10% concurrent at peak). 80% in-match = 1.2M players in matches. At 100 per match = 12,000 concurrent matches. Each player plays ~5 matches/day × 15M ÷ 100 = 750K matches/day.

Bandwidth: ~20 entities in AOI × 30 bytes/entity × 20 ticks/sec = ~700 bytes/tick. With delta compression: ~12 KB/s per player bidirectional. 100 players × 12 KB/s = 1.2 MB/s per match (~10 Mbps). 12K matches × 10 Mbps = ~115 Gbps aggregate (spread across 3K servers).

Compute: Each match fits on 2–4 CPU cores. 4–8 matches per bare-metal node = 1,500–3,000 game servers at peak. Servers are ephemeral (~25 min lifecycle), recycled at ~480/min.

Key insight: Network is the bottleneck, not storage. 115 Gbps aggregate means game servers must be regionally distributed, close to players. Player profile DB is only ~100 GB — trivial.

04

API Design

APIs split into two categories: Platform APIs (REST/gRPC for matchmaking, profiles, inventory) and the Game Protocol (custom binary UDP for real-time state sync).

Platform APIs (REST over HTTPS)

Matchmaking — Queue
POST /api/v1/matchmaking/queue
Authorization: Bearer <session_token>

{ "mode": "squad", "region": "ap-south-1", "map": "erangel", "perspective": "tpp" }

→ 202 Accepted
{ "ticket_id": "tkt_a8f3...", "estimated_wait_sec": 12, "status": "queued" }
Matchmaking — Poll for Match
GET /api/v1/matchmaking/ticket/{ticket_id}

→ 200 OK (when matched)
{
  "status": "matched",
  "match_id": "match_7bc2...",
  "game_server": {
    "host": "gs-ap-south-42.pubg.internal",
    "port": 7777, "udp_port": 7778,
    "token": "eyJhbG..."
  },
  "lobby_players": 97, "map": "erangel"
}
Player Profile & Stats
GET /api/v1/players/{player_id}/profile

→ 200 OK
{
  "player_id": "p_9d3f...", "username": "ShroudLite", "level": 47,
  "rank": { "tier": "platinum", "division": 2, "rating": 1847 },
  "lifetime_stats": {
    "matches_played": 3241, "wins": 186, "kills": 9720,
    "kd_ratio": 3.42, "avg_damage": 312.5
  }
}

Game Server Protocol (Real-Time, UDP)

Once matched, everything switches to a custom binary UDP protocol. REST doesn't work here — TCP's head-of-line blocking adds 100–300ms spikes on packet loss, and the server must broadcast state every 50ms.

Client → Server (Upstream)

PLAYER_INPUT (16 bytes, every frame): sequence number, timestamp, WASD direction, look angles, action bitfield (jump/crouch/fire/ADS/reload), weapon slot.

INTERACT (10 bytes, on action): pick up item, open door, enter vehicle. Includes target object ID.

VOICE_FRAME (~80 bytes, 50/sec while talking): Opus-encoded 20ms audio frame for squad voice chat.

Server → Client (Downstream)

WORLD_STATE_DELTA (~400–600 bytes, every tick): server tick number, last ACK'd input sequence, array of nearby player states (position, rotation, velocity, animation, health, weapon). Delta-compressed.

HIT_CONFIRM (6 bytes, reliable): damage dealt, hit zone, target health, kill flag.

ZONE_UPDATE (20 bytes, reliable): new circle center, radius, shrink speed, damage per second.

Reliability split: Position updates are unreliable (fire-and-forget — stale data is worse than lost data). Zone updates, hit confirms, and kill events are reliable-ordered (ACKs + retransmission). Libraries like ENet or Valve's GameNetworkingSockets provide both channels over a single UDP socket.

LayerProtocolAuthPeak Rate
MatchmakingREST / HTTPSJWT (session)~4K req/sec
Profile / StatsREST / HTTPSJWT~50K req/sec (cached)
Inventory / ShopREST / HTTPSJWT~10K req/sec
Social / SquadREST + WebSocketJWT~5K req/sec
Game state syncCustom binary UDPMatch token20–60 ticks/sec × 100
Voice chatUDP (Opus frames)Match token50 frames/sec
05

High-Level Architecture

Two parallel worlds: an always-on platform layer (standard microservices) and an ephemeral game server layer (stateful simulation instances that live for one match). The bridge between them is the Game Server Orchestrator.

Game Client PC / Console / Mobile CDN Patches / Assets Auth Service JWT / OAuth Matchmaker Skill Bucketing Profile Service Stats / Rank Social Service Squads / Presence Orchestrator K8s + Agones Game Server Fleet ~12K concurrent instances UDP / 20–60 Hz tick loop PostgreSQL Profiles / Stats Redis Cache / Presence Kafka Match Results Cassandra Match History Anti-Cheat ML Pipeline / Replay S3 Replays HTTPS Assets Assign UDP (gameplay) Results Replay data

Regional Architecture

Each of 5–6 major regions is nearly self-contained: its own matchmaker, game server fleet, orchestrator, database replicas, and Kafka cluster. Players are matched within their region for latency. Only account/auth, shop/payments, and global leaderboards are truly global services synced asynchronously across regions.

Game Server Orchestrator

The most unique component — doesn't exist in typical web systems. Maintains a warm pool of pre-provisioned game server containers. When the matchmaker fills a lobby, it requests a server from the pool (< 2 seconds), avoiding cold-start delays. Pre-warming uses predictive scaling based on time-of-day and historical demand. At peak, the orchestrator recycles ~480 servers per minute.

Request Flow — Step Through
ClientAuth ServiceMatchmakerOrchestratorGame ServerTick Loop (×600)KafkaMatch Processor
Click Next Step to walk through the request flow.
06

Deep Dive — Real-Time Game Networking

Everything else in this design is standard distributed systems. The hard, novel problem is keeping 100 players in sync on an 8km map with sub-100ms perceived latency over unreliable networks. Four interlocking techniques make this work.

The fundamental problem: The server and every client are always looking at different moments in time because of network latency. A player in Mumbai is ~30ms from the server; add interpolation delay and they see the world ~80ms in the past. The netcode's job is to create the illusion of a shared present moment.

Client

1. Client-Side Prediction

Your inputs are applied locally and instantly — the client runs the same physics engine as the server. You don't wait for the server to confirm movement. The client "predicts" the result. This is why your own movement always feels responsive.

Client

2. Server Reconciliation

When the server's authoritative state arrives, the client checks if its predictions were correct. If not, it re-simulates all unconfirmed inputs on top of the server's position. Small errors are blended smoothly; large errors cause a visible "rubber-band" snap.

Client

3. Entity Interpolation

Other players' positions arrive at 20 Hz but must render at 60–144 FPS. The client renders them one tick behind (~50ms in the past), smoothly interpolating between the two most recent snapshots. This is why enemies look smooth even at low tick rates.

Server

4. Lag Compensation

When you fire, you're aiming at a target ~80ms in the past. The server rewinds hitboxes to the time you saw the world and checks your shot there. This makes shooting feel accurate, but means targets can occasionally die "behind cover."

Client-Side Prediction — How It Works

Without prediction, pressing W results in 83ms of nothing (30ms to server + 3ms processing + 50ms back), then a sudden teleport. At 60 FPS, that's 5 frames of your character ignoring you. Unplayable.

With prediction: the client immediately applies your input to a local physics simulation — the same code the server runs. It stores each predicted result in a prediction buffer (ring buffer, ~128 entries = 2 seconds of inputs). When the server confirms input #N, the client discards entries up to N and re-simulates everything after N on top of the server's authoritative position.

The re-simulation typically processes 4–14 frames and takes < 0.1ms — invisible in the frame budget. When prediction matches (99% of the time), the player sees nothing. When it doesn't, the correction is blended over 50–200ms depending on error magnitude.

Lag Compensation — The Shot Sequence

sequenceDiagram participant A as Player A (Shooter) participant S as Game Server participant B as Player B (Target) Note over A: Sees B at position
from 80ms ago A->>A: Click fire — muzzle flash
plays instantly (prediction) A->>S: PLAYER_INPUT {fire, yaw, pitch, seq:47} Note over S: Receives after ~30ms S->>S: Calculate A's view time
T - 30ms - 50ms = T-80ms S->>S: Rewind B's hitbox to T-80ms S->>S: Raycast → HIT on B's torso S->>S: Apply 38 HP damage (AKM) S-->>A: HIT_CONFIRM {damage:38, zone:torso} S-->>B: DAMAGE_EVENT {from:A, dmg:38, dir:NW} Note over A: ~63ms total — sees hit marker Note over B: ~63ms total — takes damage,
screen shakes

The server maintains a position history buffer for every player (last 1 second, 20 entries × ~40 bytes = 80 KB total). On a shot, it interpolates between the two history entries bracketing the shooter's view time, reconstructs hitboxes, and checks the ray. The rewind is capped at 250ms to prevent abuse from players intentionally adding latency.

The Peeker's Advantage — An Unavoidable Tradeoff

Because of lag compensation, the player who peeks a corner has a ~80ms advantage over the defender. The peeker sees the defender immediately, but the defender sees the peeker ~80ms late. This is a fundamental consequence of physics (speed of light + network latency) — there is no solution that eliminates it without breaking hit registration for the shooter. Every competitive FPS makes this same tradeoff.

The Tick Budget

At 20 Hz, the server has 50ms per tick to process everything:

PhaseBudgetWhat Happens
Network receive~2msRead all queued UDP packets from 100 clients
Input processing~3msApply 100 players' inputs to world state
Physics simulation~8msMovement, collision, vehicles, projectiles
Hit detection~5msLag-compensated raycasts for active shooters
Game logic~4msZone damage, loot, airdrops, kills/knocks
Anti-cheat~3msSpeed checks, fire rate, position sanity
State broadcast~10msPer-player AOI filter + delta compress + send
Headroom~15msAbsorbs late-game spikes (30 players in small circle)

Area of Interest (AOI) — Spatial Grid

The 8km × 8km map is divided into 500m × 500m cells (16 × 16 = 256 cells). Each player belongs to one cell. Their AOI = their cell + 8 neighbors (3×3 grid). Only entities within these 9 cells are included in their state update — cutting broadcast from 100 to ~20 players. Cell lookup is O(1) by hashing position to cell index.

Delta Compression — 60–80% Bandwidth Savings

Instead of sending full state every tick, the server tracks what each client has ACK'd and sends only changed fields. A field bitmask (1 byte) indicates which fields follow. If only position changed: 7 bytes instead of 19 bytes per entity. If nothing changed: 1 byte. Typical savings: ~82%.

07

Key Design Decisions & Tradeoffs

1. Server-Authoritative vs Client-Authoritative

✓ Chosen

Server-Authoritative

Server runs the simulation, validates all inputs. Client sends inputs, server decides outcomes. Anti-cheat is architecturally possible. Costs 1,500–3,000 game servers at peak.

✗ Alternative

Client-Authoritative

Client tells server "I'm at X, I hit Y for Z damage." Zero validation possible. Speed hacks, aimbots, teleportation are trivially easy and undetectable. No competitive game uses this.

2. Tick Rate — 20 Hz vs 60 Hz vs 128 Hz

✓ Chosen

Adaptive 20→60 Hz

Start at 20 Hz with 100 players (50ms budget). Ramp to 30–60 Hz as players die. Balances server cost and mobile bandwidth at scale. 100 players at 60 Hz exceeds tick budget on commodity hardware.

✗ Alternative

Fixed 128 Hz

Gold standard for competitive FPS (CS2, Valorant). Only viable for 5v5 games — 100 players at 128 Hz is physically impossible. Bandwidth: ~60 KB/s per player, excludes mobile entirely.

3. UDP vs TCP for Game Traffic

✓ Chosen

Custom UDP Protocol

No head-of-line blocking. Lost position packets are fine (next tick corrects). Full control over reliability, compression, and batching. Must build reliability layer manually.

✗ Alternative

TCP / QUIC

TCP: head-of-line blocking causes 100–300ms stalls on every dropped packet (1–5% of frames on WiFi). Completely unplayable. QUIC: promising but unreliable datagrams extension not yet mature for game use.

4. Lag Compensation — Favor Shooter vs Favor Target

✓ Chosen

Favor Shooter (rewind, 250ms cap)

Shots hit where you aimed. Game is playable worldwide up to ~200ms ping. Tradeoff: victims occasionally die behind cover due to peeker's advantage. Every major FPS (CS2, Overwatch, Apex) makes this same choice.

✗ Alternative

Favor Target (no rewind)

If behind cover, you're safe. But high-ping players can never hit moving targets — must lead shots by their own latency. Unplayable above ~50ms. Excludes most of the world's players.

5. Game Server Lifecycle — Containers vs Bare Metal

✓ Chosen

Hybrid (Bare Metal + Cloud Containers)

Bare metal for baseline capacity (~40% of peak, cheapest per-match cost). Kubernetes + Agones containers on cloud for elasticity (peaks, events, new regions). Fast startup: 3–8 seconds for pre-pulled containers.

✗ Alternative

VMs Only

Slow startup (30–90s), 15–20% overhead from hypervisor, coarse granularity. Works for older infrastructure but wastes resources and can't handle demand spikes quickly enough.

6. AOI — Fixed Spatial Grid vs Quadtree

✓ Chosen

Fixed 500m × 500m Grid

O(1) cell lookup by position hash. 256 cells total, 3×3 neighbor query for AOI. Simple, cache-friendly, predictable performance. At 100 players on 8km map, density is low enough that fixed cells work perfectly.

✗ Alternative

Quadtree / Adaptive

O(log N) operations, must rebalance every tick as players move. More complex, cache-unfriendly. Shines when density varies by 100× (MMOs). Overkill for 100 uniformly-distributed players.

7. Anti-Cheat — Kernel-Level vs Server-Only

✓ Chosen

Three-Layer Defense

Kernel driver (EasyAntiCheat/BattlEye) catches memory hacks. Server-side validates physics constraints. Post-match ML detects statistical anomalies. Combined: ~85% cheat detection rate.

✗ Alternative

Server-Only Validation

No invasive client software. But cannot detect wallhacks (client renders what server sends), and "human-plausible" aimbots slip through. Only catches ~40% of cheats. Game reputation suffers.

08

What Can Go Wrong

🔥 Game Server Crash Mid-Match

Server process crashes at minute 18 — 100 players lose their match. Causes: memory leaks, unhandled edge cases (vehicle + zone boundary + revival simultaneously). Mitigation: Checkpoint state to Redis every 30 sec for crash recovery (complex, rarely implemented). Practically: invest in crash prevention — canary deploys, watchdog processes, memory thresholds that force graceful early match end. Acceptable rate: < 0.05% of matches.

⏳ Matchmaking Starvation at Low Population

4 AM in a small region — only 12 players in the platinum squad TPP queue. Queue time exceeds 5 minutes. Players leave → fewer players → longer queues → death spiral. Mitigation: Progressive bracket widening (30s: merge skill tiers, 60s: merge perspectives, 90s: fill with bots, 120s: start with 60–80 players). Cross-region matching as last resort (120–150ms ping but playable).

🔄 Desync — Client and Server Disagree

Player sees themselves behind a rock; server thinks they're exposed. Causes: floating-point divergence between client and server physics, packet loss bursts, clock drift, physics step mismatch (client at 144 FPS vs server at 20 Hz). Mitigation: Quantize positions to 1cm grid, force full state sync every 5 seconds, use fixed physics timestep on both sides, NTP-like clock synchronization.

🛡️ DDoS on Game Servers

Attacker discovers game server IPs (exposed via UDP connection handshake) and floods a specific match — especially during esports tournaments. Mitigation: Route through UDP relay/proxy (game server IP never exposed, +1–2ms latency), cloud DDoS protection (AWS Shield, Cloudflare Spectrum), kernel-level packet filtering (eBPF/XDP) dropping packets from non-player IPs.

📊 Database Hot Spots — Match-End Storm

When a match ends, 100 player stats must be updated simultaneously. Multiple matches ending together = thousands of writes in a burst. Mitigation: Game servers emit results to Kafka (fire-and-forget, < 1ms). Consumers process at steady rate. Batch all 100 updates into a single SQL transaction. Stat aggregation runs async via Flink, delayed 30–60 seconds.

🎯 Cheater Ruins Competitive Match

Aimbot user climbs to top ranks, ruining every match. Affects 99 players per match directly, destroys game reputation long-term. Mitigation: Three-layer anti-cheat (kernel + server + ML). Phone verification for ranked. Hardware ID bans. When a cheater is banned, retroactively void their last N matches and recalculate affected players' ranks.

09

Interview Tips

💡
Frame the problem as "two systems, not one."
The platform layer (matchmaking, profiles, inventory) is standard microservices — sketch it quickly. The game server layer (real-time UDP simulation) is where the novel engineering is. Say: "I'll spend 70% of our time on the game server because that's where the unique challenges are."
Let numbers drive architecture.
Don't say "we need lots of game servers." Say: "At 1.5M CCU with 100 players per match, we have 12,000 concurrent matches. Each needs 2–4 cores. That's 1,500–3,000 servers, recycling at 480/min — which tells us we need a fast orchestration layer and a warm pool."
🎯
The networking deep dive is your superpower.
Most candidates say "WebSockets" and stop. Explaining any one of the four netcode techniques (prediction, reconciliation, interpolation, lag compensation) in depth shows real-time systems understanding that most candidates can't match.
🔑
Show the scaling asymmetry.
"Unlike web services where we add stateless instances behind a load balancer, each game server is a stateful island that lives for one match. We can't move a player between servers or merge half-empty matches. This means we need a custom scheduler for ephemeral game instances."
🧠
Have a clear answer for "What's the hardest part?"
"Making shooting feel fair across varying network conditions. Do I check the shot against where the target IS now, or where they WERE when the shooter saw them? There's no perfect answer — it's a fundamental consequence of the speed of light. The engineering is building a rewind system with the right cap to feel as fair as possible for both players."
⚠️
Flag what's different from web systems.
Protocol: UDP, not HTTP. State: ephemeral in-memory, not persistent. Consistency: strong within a match (server-authoritative), eventual for platform services. Failure mode: match is disposable — invest in crash prevention, not crash recovery.
11

Evolution

How this design grows from MVP to planet-scale.

1

MVP — Single Region, 50 Players

One region, 50-player matches, 10 Hz tick rate, single map, solo mode only, no ranked matchmaking, server-side anti-cheat only, no replay system. Game servers run as plain processes on 10–20 EC2 instances with manual scaling. Capacity: ~5,000 CCU, ~100 concurrent matches. Team: 3–5 engineers. Goal: validate the core gameplay loop — can players connect, move, shoot, loot, die, and win?

2

Launch-Ready — Multi-Region, 100 Players

Scale to 100 players, 20 Hz adaptive tick rate (20→30→60 as players die), 3 regions, squad modes, container-based servers with custom orchestrator and warm pools, Kafka for match results, Redis caching, client-side anti-cheat (EasyAntiCheat), basic cosmetics. Full netcode stack: prediction, reconciliation, interpolation, lag compensation with 200ms cap, delta compression, AOI grid. Capacity: ~200K CCU, ~2,000 concurrent matches. Team: 15–25 engineers.

3

Scale — Millions of Players

5–6 fully independent regions, Kubernetes + Agones orchestration, bot backfill for low-pop queues, ranked seasons with ELO, full cosmetics economy with battle pass, replay system (5% stored), post-match ML anti-cheat, global leaderboards, spectator mode for esports, cross-platform play (PC + console + mobile with per-platform optimizations). Hybrid infra: bare metal baseline + cloud burst. Capacity: ~1.5M CCU, ~12,000 concurrent matches. Team: 50–100 engineers.

4

Planet-Scale — Live Service

Custom bare-metal fleet in 10+ regions with edge compute in ISP PoPs. ML-driven matchmaking (play style, toxicity, connection quality). A/B testing framework for game mechanics. Advanced anti-cheat: behavioral fingerprinting, hardware attestation, legal takedowns. User-generated content (custom modes, map editor). Streaming integration (Twitch drops, anti-stream-sniping). 200+ engineers across multiple studios. The hard problems become organizational (modular codebase for 200 devs), operational (rolling deploys with 2–3 server versions live), and financial ($3–5M/month infrastructure — every 10% efficiency gain saves $500K/year).

Next up