Concepts

The vocabulary of
system design.

Deep, interview-grade references for the concepts that show up in every system design discussion. Each concept explains the intuition first, then the mechanics, then the tradeoffs, then the deep dive. Cross-linked to the problem pages that use them.

110 / 110 concepts shipped15 categoriespress ⌘K to search

Foundations

9 / 9 shipped
CONCEPT~4 min

Interview Framework

You get 45 minutes. The interviewer says "design Twitter." There are a thousand things to say and you'll remember half of them. Without structure, you'll ramble about load balancers for 20 minutes, never estimate scale,

CONCEPT~3 min

Back-of-Envelope Estimation

Every design decision hinges on numbers. Do we need sharding? Only if writes exceed ~10k/sec. Do we need a cache? Only if read latency is a bottleneck. Do we need a CDN? Only if users are geographically far from origin.

CONCEPT~3 min

Latency Numbers

Every architectural decision — "do we need a cache?", "should this call be async?", "can we fit this in RAM?" — collapses to the same question: how long does the operation actually take? Engineers who can't estimate this

CONCEPT~3 min

Availability — The Nines

"Four nines" (99.99%) sounds marginally better than "three nines" (99.9%). The truth: four nines allows 52 minutes of downtime per year; three nines allows 8.8 hours. That's a 10× difference in how much engineering you p

CONCEPT~3 min

SLOs, SLIs, SLAs

"Our service is reliable" is not a claim; it's marketing. "Our service is reliable" with numbers — "99.9% of requests complete under 200ms, measured over a 30-day window" — is an engineering target. Without SLOs/SLIs/SLA

CONCEPT~3 min

Time Synchronization & Clocks

Two machines, separated by a network, disagree on the order of events. "Event A at 12:00:00.500" on server 1. "Event B at 12:00:00.400" on server 2. Did B come before A? Only if the clocks agree, and they never perfectly

CONCEPT~3 min

Concurrency Models

"Handle 50k concurrent connections" is an easy requirement to write and a hard one to meet. Every language offers a concurrency model — threads, event loops, actors, coroutines, or some combo. Each has different costs: t

CONCEPT~3 min

Shared-Nothing Architecture

The single most important architectural principle for scalable systems: no resource is shared across nodes. No shared memory, no shared disk, no shared lock manager, no shared cache. Each node owns its slice of data and

CONCEPT~3 min

Multi-Tenancy

You're building B2B SaaS. 10,000 customer companies, each with their own users + data + settings. Should each customer get their own database, their own EC2 instances, their own Kubernetes namespace? Or do they all share

Networking & Delivery

13 / 13 shipped
CONCEPT~4 min

DNS

Every request to your API starts with "what IP is api.example.com?" That's DNS. Get it wrong and users can't reach you. Configure TTLs wrong and users are stuck on a dead IP for hours after you change it. Pick the wrong

CONCEPT~4 min

CDN

Your origin server is in Virginia. Your user is in Singapore. A round-trip over the Pacific is ~180ms, minimum, without TLS handshake, TCP slow-start, or any actual work. Serve a page with 40 assets (images, scripts, fon

CONCEPT~3 min

TCP vs UDP

"TCP is reliable, UDP is fast" is the bumper sticker. True, but the interesting question is why you'd ever choose UDP. For decades the answer was niche (DNS, streaming). Now — with QUIC, WebRTC, and modern games — UDP is

CONCEPT~3 min

HTTP/1 vs HTTP/2 vs HTTP/3

Picking HTTP/2 over HTTP/1 gave an instant 2× speedup on a typical page load with no app code changes. HTTP/3 fixes the one failure mode HTTP/2 still had. In interviews: naming the specific wins of each version shows you

CONCEPT~3 min

TLS & HTTPS

Every byte between a client and your server passes through routers you don't control. Without encryption, anyone on the path reads passwords, session tokens, the works. TLS encrypts it all — but also adds latency (the ha

CONCEPT~3 min

Proxy vs Reverse Proxy

Both types of proxy stand between a client and a server. Which one faces which way flips the use case entirely. Candidates mix them up all the time. Interviewers notice.

CONCEPT~3 min

WebSockets vs SSE vs Polling

Chat. Live scores. Notifications. Collaborative editing. Any feature where the server pushes data to the client the moment it's ready. HTTP was designed for the opposite: client asks, server answers. To make push happen,

CONCEPT~3 min

REST vs GraphQL vs gRPC

Picking an API protocol shapes your client experience, your throughput, your tooling, and your caching story for years. "We'll use REST" is the lazy default, often the wrong one. gRPC is 7× faster but kills browser usabi

CONCEPT~4 min

Service Mesh

You have 50 microservices in production. Every service-to-service call needs: TLS encryption, mTLS auth, retries, timeouts, circuit breaking, observability, traffic routing. Implementing all of that in every service, in

CONCEPT~4 min

Webhooks

Stripe processes a payment. Your app needs to know when it succeeds. Two options:

CONCEPT~3 min

API Versioning

You ship v1 of your API. Customers integrate. Six months later, you need to rename a field, change a response shape, drop a deprecated endpoint. Every existing customer integration will break. They didn't sign up for a m

CONCEPT~3 min

Edge Computing

Your origin server is in Virginia. A user in Singapore: 180ms round-trip just to reach you, before any work. CDNs solved this for static content. Edge computing extends the same idea to code — run your business logic at

CONCEPT~3 min

Compression & Encoding

Sending 100 KB to a user costs 100ms over a 8 Mbps link. Compress it to 20 KB → 20ms. Compression is the cheapest performance win in your stack: zero infrastructure changes, ~80% of the bytes saved, milliseconds of CPU.

Scaling

5 / 5 shipped

Databases

15 / 15 shipped
CONCEPT~4 min

SQL vs NoSQL

Picking the wrong database is the most expensive architectural mistake you can make early. You can change a cache vendor in a weekend; you can't change your database in a quarter. Interviewers want to see a justified cho

CONCEPT~3 min

ACID vs BASE

"Strong consistency" and "eventual consistency" are the two ends of a spectrum that drives every data-layer decision. Pick strong when you must (payments, inventory, anything involving money or exclusivity). Pick eventua

CONCEPT~4 min

CAP Theorem

Every distributed database picks one of two behaviors when the network between nodes breaks: return stale data or refuse to serve the request. CAP is the formal statement that you cannot avoid this choice. Pretending oth

CONCEPT~3 min

PACELC Theorem

CAP only describes what happens during a partition — but partitions are rare. Most of your system's life is spent in the no-partition state, where CAP has nothing to say. The real everyday tradeoff there is between laten

CONCEPT~3 min

Replication

One database server holds your data. The disk dies. You lose everything. Replication — keeping copies on multiple servers — solves three problems at once: durability (survive hardware failures), availability (serve reads

CONCEPT~4 min

Sharding

One server, even a massive one, tops out around ~100k writes/sec. Replication buys you read scale but not write scale — the leader is still a bottleneck. Once writes exceed one box, you shard: split the data across N box

CONCEPT~3 min

Indexing

A database table with 100M rows. A query: WHERE email = 'x@y.com'. Without an index, the DB scans all 100M rows — seconds per query. With a B-tree index, it's ~10 lookups, 0.5ms total. That's a 10,000× difference and the

CONCEPT~4 min

Normalization vs Denormalization

Your data model has two extremes. Fully normalized: every fact lives in exactly one place, connected by foreign keys. Great for writes (update once, seen everywhere), bad for reads (join 5 tables to render one page). Ful

CONCEPT~4 min

Database Types

"We'll use a database" is not an architecture decision — it's punting. Every system design interview eventually asks: Postgres or Cassandra? Redis or DynamoDB? Mongo or Elasticsearch? The answer is never "it depends" alo

CONCEPT~4 min

Distributed Transactions

User clicks "Buy." You must: charge card (payments service) + decrement inventory (warehouse service) + create order (orders DB). All three or none — partial is catastrophic (card charged, no order). In one database, ACI

CONCEPT~3 min

Write-Ahead Log (WAL)

You commit a transaction. Two milliseconds later, the power fails. On reboot, is the commit there? If yes, your DB is durable. If no, it's a lie. The trick every durable datastore uses — Postgres, MySQL, Cassandra, SQLit

CONCEPT~3 min

Connection Pooling

Opening a TCP connection takes 1 round-trip. Adding TLS adds another. Adding Postgres handshake adds 2-3 more. Each connection: ~5-10ms of pure overhead before any query runs. If your app opens a new connection per reque

CONCEPT~3 min

Database Federation

One huge database holds everything: users, orders, products, reviews, notifications, sessions. As the company grows, this DB becomes the bottleneck — every team's slow query affects everyone, schema changes touch everyon

CONCEPT~4 min

Change Data Capture (CDC)

Your transactional DB has the source-of-truth for orders. You also need: a search index in Elasticsearch, a denormalized read store in Redis, an analytics warehouse, a notifications pipeline triggered on order events. Fi

CONCEPT~3 min

Data Lake vs Warehouse

"Where do analytics queries run?" If you say "Postgres," you've never had real analytics needs. Running aggregations over 10 TB of order history on your transactional DB tanks every API. Analytics belongs in a separate s

Caching

4 / 4 shipped

Messaging & Async

5 / 5 shipped

Distributed Systems

13 / 13 shipped
CONCEPT~5 min

Consensus — Paxos & Raft

Five nodes each think they might be the leader. The old leader's network cable was unplugged; now it's plugged back in. Who's in charge? Who accepts writes? If two nodes both think they're leader, they both accept writes

CONCEPT~5 min

Distributed Locking

Two instances of your service want to send a reminder email for order #42. Both run the cron job at the same second. Without coordination, the user gets two emails. Or two payment workers process the same refund. Or two

CONCEPT~4 min

Leader Election

Distributed systems are full of "only one at a time" jobs: one cron runner, one database primary, one billing reconciler, one job scheduler. Without coordination, every replica does the same work — duplicate emails, dupl

CONCEPT~4 min

Consistent Hashing

You have 10 cache servers. You route keys to them with hash(key) mod 10. It works beautifully. Then you add an 11th server. Now ~90% of your keys hash to a different server — your entire cache invalidates. Origin gets ha

CONCEPT~4 min

Vector Clocks & LWW

Two users edit the same document at (roughly) the same moment. Their clients both commit locally. Now the server has two versions. Which one wins?

CONCEPT~4 min

Gossip Protocols

1000 nodes in a Cassandra cluster. Each needs to know about the others — who's alive, what ranges they own, their load. Centralized approach (one server polls all 1000) creates a SPOF and a bottleneck. Broadcasting (each

CONCEPT~3 min

Quorum

In a replicated system, you have N copies of the data. Do you write to all N before returning? Just 1? Somewhere in between? Do you read from all N (slow), or 1 (possibly stale)? The answer is a quorum — a minimum number

CONCEPT~4 min

Heartbeat & Failure Detection

Is node 7 dead, or is its network link just slow? You can't tell from a missed response. Decide too quickly → false-positive, you kick a healthy node. Decide too slowly → real failures go unnoticed, requests time out for

CONCEPT~3 min

Service Discovery

Service A calls Service B. Where is B? Hard-coding an IP breaks the moment B autoscales, moves to a new host, or deploys to a new region. Hard-coding a hostname helps, but still routes through DNS caches that are slow to

CONCEPT~4 min

Two Generals & Byzantine Problems

The two foundational impossibility results in distributed systems. Knowing them isn't trivia — they tell you exactly what's possible to build and what's not. Every distributed protocol you'll ever design is constrained b

CONCEPT~4 min

Read Repair & Anti-Entropy

You replicate data across 3 nodes for durability. A network blip causes node 2 to miss a write. Now node 1 + 3 have v2; node 2 has v1. Replicas have diverged. Without active repair, this divergence accumulates — the long

CONCEPT~3 min

Tunable Consistency per Query

"What consistency does our database give us?" is the wrong question. Different queries need different guarantees. A user's password change must be strongly consistent (next login should see the new hash). A like-count di

CONCEPT~3 min

Clock-Skew Tolerance Design

You read time-sync-clocks and learned wall clocks drift, NTP misbehaves, leap seconds break things. Now what? You can't avoid using time entirely — TTLs, timeouts, scheduling, ordering, JWT expiry all need it. Clock-skew

Reliability

11 / 11 shipped
CONCEPT~3 min

Circuit Breaker

Service B is slow. Service A calls it on every request, each call hanging for 30 seconds before timing out. A's threads fill up waiting for B. A's latency spikes. A's LB marks A unhealthy. Now A is down because B is slow

CONCEPT~3 min

Retries, Backoff & Jitter

Network calls fail. Retry solves 90% of transient failures for free. But naive retries cause retry storms — every client retrying the failing service simultaneously, which is exactly what the failing service can least ha

CONCEPT~3 min

Bulkhead Isolation

Your service has 200 threads. It calls services A, B, C. Service C hangs. Requests to C accumulate; all 200 threads end up waiting on C. Requests to A and B can't get served because no thread is free — even though A and

CONCEPT~3 min

Rate Limiting Algorithms

"Limit users to 100 requests/minute" sounds simple. It is not. Which minute? The last 60 seconds? The current calendar minute? Count as they come, or enforce an even drip? Each answer is a different algorithm with differ

CONCEPT~4 min

Idempotency

A user clicks "Pay." The request hits your server, charges their card, but the response packet is lost in the network. The user's app doesn't see success, retries. Now you've charged them twice. Classic distributed-syste

CONCEPT~3 min

Graceful Degradation

Your recommendation service is down. Does your homepage return a 500, or does it just skip the "Recommended for you" section and still render? The first is a hard failure; the second is graceful degradation. Same failure

CONCEPT~3 min

Backpressure & Flow Control

A producer emits 100k events/sec. A consumer can process 10k/sec. With no coordination, the consumer's queue grows unboundedly — memory bloats, latency cliffs, eventually the process OOMs. Backpressure is the mechanism b

CONCEPT~3 min

Feature Flags & Rollouts

"Deploy a new feature to all 100M users at once" is asking for a 3am incident. Feature flags let you separate code deploy from feature release: ship the code dark, then turn it on for 1% of users, watch metrics, ramp to

CONCEPT~3 min

Chaos Engineering

You think your system handles a database failover. The runbook says it does. The diagram says it does. But you've never actually tested it. The first time it happens for real — at 3am, with no warning — you discover the

CONCEPT~3 min

Request Hedging

Your service makes 5 backend calls per user request. Each call has a P99 of 50ms — sounds great. But the combined P99 is much worse: even one slow call ruins the request. With 5 calls, the chance of at least one being a

CONCEPT~4 min

Blue-Green & Canary Deployments

You ship 100 deploys a week. Each one risks breaking production. The naive "stop the old version, start the new version" gives you 30 seconds of total outage and a rollback that takes longer than the original deploy. Dep

Observability & Security

8 / 8 shipped
CONCEPT~3 min

Observability Triad

3 AM. Your on-call phone rings. Something is broken. You have 15 minutes before users notice and executives notice after that. With good observability, you grep a log, check a metric, pull a trace, and see the problem in

CONCEPT~4 min

Auth — OAuth & JWT

Every API needs to answer: who is this user, and what are they allowed to do? Roll your own auth and you'll have a CVE within months. OAuth 2.0 is how you delegate authentication to a trusted identity provider. JWTs are

CONCEPT~3 min

DDoS Protection

A botnet of 100,000 compromised devices each sends 10 requests/sec at your site. 1M RPS at your origin. Your LB handles 50k. Site is down. No users can reach you until the attack stops or you have defenses.

CONCEPT~3 min

Secret Management

Your app needs a Postgres password, an AWS access key, a Stripe API key, a Twilio token, an OAuth client secret. Where do they live?

CONCEPT~3 min

Zero Trust Networking

The traditional security model: castle-and-moat. Strong perimeter (firewall + VPN); inside the perimeter, services trust each other implicitly. The flaw: one compromised laptop or one breached service inside the perimete

CONCEPT~4 min

Tokenization & PCI Compliance

Your e-commerce app needs to store credit card numbers so customers don't re-enter them every time. The moment you do, your entire stack — every server that touches that data, every database that stores it, every backup

CONCEPT~4 min

Field-Level Encryption

"Our database is encrypted." Sounds great. Look closer: encrypted at rest with a single key controlled by the cloud provider. A DBA, a compromised app, an SRE running SELECT * sees plaintext. Encryption-at-rest only prot

CONCEPT~4 min

GDPR — Right to Be Forgotten

EU GDPR Article 17: a user can request all of their personal data be deleted. You have 30 days. "Delete a row" sounds simple. The problem: that user's data is in the live DB, in 7 read replicas, in 30 days of database ba

Data Structures

7 / 7 shipped
CONCEPT~4 min

Bloom Filter

You have 1 billion URLs you've already crawled. Before crawling a new URL, you want to check "have I seen this before?" A hash set costs ~100 GB of RAM for 1B URLs. Most queries will be for URLs you've never seen. You ju

CONCEPT~4 min

Geospatial Indexes

"Find all drivers within 2km of (37.77°N, 122.42°W)." A standard index on (lat, lng) doesn't help — you'd scan every row checking distance. Geospatial indexes let you answer proximity queries in milliseconds over million

CONCEPT~3 min

Merkle Trees

Two Cassandra replicas hold 100M rows each. They're supposed to be identical. Something went wrong and one has a few stale entries. Comparing all 100M rows to find the differences would take hours and saturate the networ

CONCEPT~4 min

HyperLogLog & Sketches

"How many unique visitors did we have today?" Naive: keep a Set<user_id>. For 100M users, that's 6+ GB of RAM. Do it per-URL or per-minute and you're out of memory fast. HyperLogLog answers the same question with 12 KB,

CONCEPT~3 min

Memory-Mapped Files (mmap)

Reading a 100 GB file the normal way: read() in chunks, copy from kernel buffer to user buffer, process. Two copies per byte. Slow at scale. mmap maps the file's bytes directly into your process's virtual memory — access

CONCEPT~4 min

Erasure Coding

Storing 1 PB of files with 3× replication = 3 PB of disk. Storage at scale is dominated by disk cost, and 3× is expensive — Facebook's photos, Netflix's video, Google's mail all add up to exabytes. Erasure coding achieve

CONCEPT~4 min

URL Encoding & Base62

"Make me a short URL." Sounds trivial. The shortcode must be: unique, URL-safe, short (6-8 chars), and ideally non-guessable. Plus, you need to mint billions of them at thousands per second without coordination overhead.

Machine Learning Systems

7 / 7 shipped
CONCEPT~3 min

Feature Store

"User's last 7-day click count" is a feature. Your training pipeline computes it from click logs in Spark; your serving pipeline computes it from a Kafka stream. The two implementations drift — one bug here, one rounding

CONCEPT~3 min

Model Serving — Online vs Batch

You trained a recommendation model. Now what — does it predict in real time when the user opens the app, or do you precompute every user's recs nightly and read from a cache? Same model, completely different infrastructu

CONCEPT~3 min

Vector Databases

The query is "movies like Inception." Postgres can't help — there's no SQL operator for "semantically similar." With embeddings (vectors of 384-1536 floats representing meaning), the answer is: nearest neighbors of Incep

CONCEPT~3 min

Embedding Generation Pipelines

You have 100 million product descriptions, 50 million user-generated images, 1 billion documents. To use them with vector search, every one needs an embedding — a 384–1536-dim vector. Calling an embedding model API for e

CONCEPT~3 min

Online vs Offline Training

Your recommendation model was trained on last month's data. New trends emerged. New users appeared. The model is already stale — predictions degrade by the hour. Do you re-train nightly (offline), continuously update wei

CONCEPT~3 min

A/B Testing Platform

You ship a new homepage. Did it improve conversion or hurt it? Eyeballing the dashboard for a week proves nothing — traffic patterns shift hour to hour, day to day. A/B testing is the discipline of statistically comparin

CONCEPT~4 min

LLM Serving Infrastructure

Serving GPT-class LLMs is unlike serving any model that came before. A single inference can take 30 seconds, generates thousands of tokens, requires tens of GB of GPU memory, and costs $0.001-0.10 per request. Multiply b

Architecture Patterns

5 / 5 shipped

Operations

3 / 3 shipped

Frontend & Mobile

4 / 4 shipped

Interview Tactics

1 / 1 shipped