Rate Limiter

02

Requirements

Functional

Limit requests per user ID, IP address, or API key
Support multiple rules — different limits per endpoint
Return 429 Too Many Requests with headers when limit exceeded
Allow configurable limits — rules updatable without redeployment
Support multiple time windows — per second, per minute, per day

Non-Functional

Rate limiting adds <5ms p99 latency per request
Handles 100K+ RPS without degrading
Highly available — a rate limiter failure must not take down the API
Approximate accuracy acceptable — fail open on Redis outage
Rules stored in a config layer — not hardcoded

Clarify Early in Interview

Before drawing anything, ask: What dimension are we limiting on? — user ID, IP, API key, or global per-endpoint? Each changes where the limiter lives and what data it reads. Also ask whether the system is client-facing (fail open) or internal/financial (fail closed). This one question separates strong candidates from weak ones.

03

Scale Estimation

Metric	Calculation	Result
Peak RPS	Assume 10M active users, 10 req/user/min avg	~100K RPS
Redis reads/writes	2 ops per request (INCR + EXPIRE check)	200K ops/s
Redis throughput	Single Redis node: ~200K ops/s. We need cluster.	3–5 nodes
Memory per user	Sliding window counter: 2 integers per user = ~100 bytes	~1 GB / 10M users
Latency budget	Redis in-region RTT ~0.5ms + processing = well under 5ms	<2ms typical

Key Insight

The 2 ops per request is the critical number. At 100K RPS that's 200K Redis operations per second — right at the edge of a single Redis node's capacity. This is why we need a Redis cluster. The memory footprint is tiny (~1 GB for 10M users) because the sliding window counter stores just two integers per user, not a full timestamp log.

04

API Design

The rate limiter is not a standalone REST service with its own endpoints — it's a middleware layer. But the headers it adds to every response are its public interface.

GET /v1/any-endpoint Normal allowed response

// Response 200 — includes rate limit state in headers HTTP/1.1 200 OK X-RateLimit-Limit: 1000 // max requests per window X-RateLimit-Remaining: 743 // requests left this window X-RateLimit-Reset: 1704067200 // Unix timestamp when window resets

GET /v1/any-endpoint Rate limited response

// Response 429 — client must back off HTTP/1.1 429 Too Many Requests X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1704067200 Retry-After: 47 // seconds until client may retry { "error": "rate_limit_exceeded", "retry_after_seconds": 47 }

GET /internal/rate-limit/config Admin — view current rules

// Response 200 — rule config (internal only) { "rules": [ { "endpoint": "/v1/search", "limit": 100, "window_seconds": 60 }, { "endpoint": "/v1/messages", "limit": 1000, "window_seconds": 60 }, { "endpoint": "*", "limit": 5000, "window_seconds": 60 } ] }

Why Retry-After Matters

Without Retry-After, rejected clients retry immediately — creating a thundering herd that makes the rate limit problem worse. A client that reads this header backs off for exactly 47 seconds and tries once. This transforms a flood of retries into a trickle. This is the answer most candidates miss when asked "how do you handle the 429 response?"

05

High-Level Architecture

Architecture — Where Rate Limiting Lives SVG Diagram

CDN / WAF (Edge)

Blunt IP-level protection. Blocks obvious abuse (DDoS, scrapers) before traffic enters your infrastructure. Coarse-grained — doesn't know user IDs or API keys. Cloudflare and AWS WAF handle this layer.

API Gateway — The Main Layer

Sees the full request context: user ID, API key, endpoint, method. This is where nuanced per-user per-endpoint limits are enforced. One place, all services benefit. No logic duplication across microservices.

Redis Cluster

Stores counters. Atomic INCR prevents race conditions. Cluster mode shards keys across nodes to handle 200K+ ops/sec. TTL auto-expires old windows — no background cleanup jobs needed.

Why Not Inside Each Service?

Duplicating rate limiting logic across 20 microservices means 20 places to fix bugs, 20 separate Redis connections, and no cross-service view. A user could bypass per-service limits by calling multiple services simultaneously. Don't do this.

06

Deep Dive — The Five Algorithms

The Core Question

Every rate limiting algorithm answers one question: "How many requests has this user made in the last N seconds?" The algorithms differ only in how they store that information — and the tradeoff is always between memory cost and accuracy.

Algorithm 1 — Fixed Window Counter O(1) memory · Has boundary flaw

// Redis key includes the minute bucket key = "rate:user123:2024:01:01:12:34" // expires at minute boundary INCR key EXPIRE key 60 // The boundary burst problem: // 11:59:50 → 100 requests allowed (end of window) // 12:00:10 → 100 requests allowed (start of next window) // Real span = 20 seconds, 200 requests → 2× your limit

Algorithm 2 — Sliding Window Log O(requests) memory · Perfectly accurate · Too expensive

// Store every request timestamp in a sorted set key = "rate:user123" ZREMRANGEBYSCORE key 0 (now - 60s) // remove expired count = ZCARD key // count remaining if count < limit: ZADD key now now // log this request // Cost: 1000 req/min = 1000 timestamps stored per user // At 1M users = 1B timestamps in Redis. Too expensive.

Algorithm 3 — Token Bucket O(1) memory · Allows intentional bursting

// Redis stores just two values per user { "tokens": 7, "last_refill": 1704067140 } // On each request: elapsed = now - last_refill new_tokens = min(capacity, tokens + elapsed × refill_rate) if new_tokens >= 1: store { tokens: new_tokens - 1, last_refill: now } return ALLOW else: return REJECT // Burst allowed: idle user accumulates tokens, spends all at once // Used by: AWS API Gateway, Stripe, most user-facing APIs

Algorithm 4 — Leaky Bucket O(1) memory · Fixed output rate · No bursting

// Requests queue up; they leave at a fixed drip rate // Output is always uniform regardless of input burst queue.enqueue(request) // add to bucket if queue.size > capacity: return REJECT // bucket overflows // Process at fixed rate: 1 request per 10ms regardless // Use case: protecting a downstream system that can't handle bursts // e.g. a payment processor, a slow 3rd-party API

Algorithm 5 — Sliding Window Counter O(1) memory · ~Perfect accuracy · Production standard

// Store only two counters per user: previous window + current window prev_count = 80 // requests in the previous full minute curr_count = 40 // requests so far in the current minute elapsed_pct = 0.75 // 45 seconds into current 60s window // Interpolate: how much of the previous window is still in our lookback? prev_weight = 1.0 - elapsed_pct // = 0.25 (25% of prev window remains) estimate = (prev_count × prev_weight) + curr_count = (80 × 0.25) + 40 = 20 + 40 = 60 // well under limit of 100 // Boundary burst example — what fixed window gets wrong: // User sent 100 req at 11:59:30. Now it's 12:00:30 (30s into new window) estimate = (100 × 0.5) + 0 = 50 // correctly blocks burst — only 50 headroom // Memory: just 2 integers per user regardless of traffic volume // Error vs perfect sliding window: ~0.003% in practice

Algorithm	Memory	Burst	Accuracy	Best For
Fixed Window	O(1)	Boundary burst	Flawed	Simple internal tools
Sliding Window Log	O(n)	Controlled	Perfect	Low-traffic, high-accuracy
Token Bucket	O(1)	Intentional	Good	User-facing APIs
Leaky Bucket	O(1)	None	Good	Protecting downstream systems
Sliding Window Counter	O(1)	Controlled	~Perfect	Most production systems ✓

Sequence — Request Allowed vs Rejected Mermaid.js

sequenceDiagram participant C as Client participant GW as API Gateway participant RL as Rate Limiter participant R as Redis participant S as Service C->>GW: GET /v1/search GW->>RL: check_limit(user_id, endpoint) RL->>R: INCR rate:user123:search:current R-->>RL: counter = 43 RL->>R: GET rate:user123:search:previous R-->>RL: prev = 80 Note over RL: estimate = 80×0.25 + 43 = 63 < 100 ✓ RL-->>GW: ALLOW (remaining: 37) GW->>S: Forward request S-->>C: 200 OK + X-RateLimit headers Note over C,S: --- User exceeds limit --- C->>GW: GET /v1/search (101st request) GW->>RL: check_limit(user_id, endpoint) RL->>R: INCR rate:user123:search:current R-->>RL: counter = 101 Note over RL: estimate = 80×0.25 + 101 = 121 > 100 ✗ RL-->>GW: REJECT GW-->>C: 429 Too Many Requests + Retry-After: 47

07

Key Design Decisions & Tradeoffs

Decision 1 — Where to place the rate limiter

Option A — Chosen

API Gateway (Centralised)

Single enforcement point. All services benefit automatically. No logic duplication. Gateway already has auth context — knows user ID, API key, which endpoint. Easiest to maintain and update rules.

✓ Standard production choice

Option B

Inside Each Microservice

Maximum flexibility — each service sets its own rules. But logic is duplicated across N services. A cross-service attacker can hit service A 100 times and service B 100 times, bypassing any per-service limit.

✗ Logic duplication, bypass risk

Decision 2 — Redis failure behaviour (CAP tradeoff)

Option A — Chosen for most APIs

Fail Open

If Redis is unreachable, allow all requests through. API stays up and serving users. Rate limiting is temporarily unenforced during the outage. Acceptable for social, search, content APIs where slight over-allowance is harmless.

✓ Availability over consistency

Option B — For financial/sensitive APIs

Fail Closed

If Redis is unreachable, reject all requests with 429. Rate limiting always enforced. But your API is entirely down even though your application servers are healthy. Appropriate for payment APIs, auth endpoints.

~ Consistency over availability

Decision 3 — Multi-region enforcement

Option A — Chosen

Local Budgets (Approximate)

Divide global limit by number of regions. Each region enforces its share independently. No cross-region Redis calls — zero latency overhead. A user could theoretically use ~N× limit if perfectly distributed across N regions, but this is rare and acceptable.

✓ Low latency, good enough accuracy

Option B

Global Redis Sync

All regions point to one global Redis (or use cross-region replication). Perfect global accuracy. But every single request now pays 50–150ms of cross-region latency just to check a counter. Completely defeats the purpose of a multi-region deployment.

✗ Latency kills the design

Decision 4 — Algorithm choice

Option A — Chosen

Sliding Window Counter

O(1) memory. Near-perfect accuracy (~0.003% error). No boundary burst problem. Just two integers per user. Works perfectly with Redis atomic operations. The best cost/accuracy ratio of all five algorithms.

✓ Best of all worlds

Option B — When accuracy is paramount

Sliding Window Log

Mathematically perfect — no approximation at all. But stores every request timestamp. At high traffic, memory explodes. 1M users × 1000 req/min = 1 billion Redis entries. Only worth it for very low traffic or extremely strict accuracy requirements.

~ Only at very low scale

08

What Can Go Wrong

⚡

Race Condition on Distributed Counter

Two API gateway nodes read the counter simultaneously — both see 99, both allow the request, both write 100. User sneaks through 101 requests at the limit boundary. At high concurrency this happens constantly with naive read-modify-write logic.

→ Fix: Redis atomic INCR — reads and writes in a single uninterruptible operation. No race possible.

🔴

Redis Node Goes Down

Redis is your single point of failure for rate limiting. If the Redis cluster loses a primary node, rate limiting is unavailable. The policy question: do you fail open (allow everything, API stays up) or fail closed (block everything, API goes dark)?

→ Fix: Redis Sentinel / Cluster for HA. Fail open for consumer APIs, fail closed for financial endpoints.

🌊

Thundering Herd from Retry Storms

Rate limiter rejects 10,000 users at once. All their clients immediately retry. Another wave of 10,000 requests arrives — all get rejected again. This loop can repeat indefinitely, creating a denial-of-service from your own clients even after the original load spike has passed.

→ Fix: Always return Retry-After header. Clients that respect it stagger retries. Add exponential backoff guidance in API docs.

🌍

Multi-Region Limit Bypass

Global limit is 1,000 req/min split across 3 regions (333 each). A sophisticated client routes 333 requests to US, 333 to EU, 333 to Asia simultaneously. They've sent 999 requests — just under the global limit per region, but used the full budget. In practice this is very hard to exploit but theoretically possible.

→ Fix: Accept the approximation for most use cases. For strict global limits, use a dedicated global coordination service (rare).

🔑

Limit Dimension Mismatch

Rate limiting by IP address seems sensible until a corporate office has 10,000 employees all behind a single NAT IP. Every employee's request counts against one shared bucket — one person's script brings down everyone's access. Conversely, a single attacker behind a VPN can rotate IPs to avoid per-IP limits.

→ Fix: Prefer user ID or API key over IP. Use IP only at the edge for unauthenticated traffic. Layer multiple dimensions.

⚠

Anti-patterns

🚫

Fixed-window counters in Redis

Burst right at the window boundary passes 2× the limit.

✓ Better: Sliding-window log or sliding-window counter; smooths burst at boundary.

🚫

Check limit in app server memory

Each instance has its own counter; user rotates through instances.

✓ Better: Centralized Redis INCR with TTL; or token bucket per user_id.

🚫

One bucket for all traffic

Can't limit per-API or per-user class.

✓ Better: Hierarchy: global → per-tenant → per-endpoint → per-user; deny if any exceeded.

09

Interview Tips

01

Ask what dimension before drawing anything. "Are we limiting by user ID, IP, API key, or global per-endpoint?" The answer completely changes where the limiter lives and what Redis key structure you use. Interviewers love candidates who clarify constraints before jumping to architecture.

02

Know all five algorithms and their tradeoffs. Most candidates only know token bucket and fixed window. Walking through all five — and explaining why sliding window counter wins — demonstrates serious depth. The comparison table in section 06 is your cheat sheet.

03

The 429 response is not just a status code. The most common follow-up question is "what do you return when you reject?" Most candidates say "429 and move on." The complete answer is: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After. Explain that Retry-After prevents thundering herds from retry storms. This answer separates strong candidates from everyone else.

04

Name the CAP tradeoff explicitly. When asked about Redis failure: "This is a CAP theorem decision. Rate limiting is shared distributed state — on a partition we choose between consistency (fail closed) and availability (fail open). For this API I'd choose availability because approximate correctness is acceptable here." Shows architectural thinking, not just implementation knowledge.

05

Bring up multi-region unprompted. Interviewers almost always ask "what happens across multiple data centers?" Getting there first — and explaining local budgets vs global Redis sync — shows you're thinking at scale. Say: "The obvious solution is a global Redis, but cross-region latency of 50–150ms on every request defeats the point. I'd split the budget by region and accept the approximation."

06

Don't forget the rule configuration layer. Rate limits shouldn't be hardcoded in the gateway. Mention a config service (or even a Redis hash) that stores rules per endpoint and per user tier. This shows you've thought about operability — how ops teams change limits without redeploying the entire gateway.

The Answer Most Candidates Miss — 429 Response Headers

HTTP Response — Rate Limited 429 Too Many Requests

X-RateLimit-Limit 1000 Maximum requests allowed in the window

X-RateLimit-Remaining 0 Requests left in current window

X-RateLimit-Reset 1704067200 Unix timestamp when the window resets

Retry-After 47 Seconds until the client may retry — prevents thundering herds

Why This Matters

Without Retry-After, well-meaning clients retry immediately in a loop — creating a thundering herd that makes your rate limiting problem worse. A client that reads this header waits exactly 47 seconds and tries once. This single header transforms a flood of rejected retries into a controlled trickle. Include these headers on every response, not just 429s — clients use X-RateLimit-Remaining to proactively throttle themselves before hitting the limit.

10

How the Design Evolves

Phase 1 — Single Server

In-Memory Counter, No Redis

A simple HashMap in your application server's memory. Works perfectly for a single-server deployment. Ship this first. When you add a second server, the counters diverge and you graduate to Redis. Don't over-engineer before you have the problem.

Phase 2 — Multiple Servers, Single Region

Centralised Redis + Sliding Window Counter

Move counters to a single Redis instance. Add atomic INCR to eliminate race conditions. Implement sliding window counter algorithm for accuracy. This is the correct architecture for 99% of production systems and where most teams stay indefinitely.

Phase 3 — High Traffic, Single Region

Redis Cluster + Rule Config Service

Shard Redis across 3–5 nodes to handle 200K+ ops/sec. Move limit rules from hardcoded config into a dedicated rule store (Redis hash or a lightweight config service). Operators can now adjust limits without redeployment. Add per-endpoint, per-user-tier rules.

Phase 4 — Multi-Region Global

Regional Budgets + Edge Enforcement

Each region gets a fraction of the global limit and enforces it locally. No cross-region Redis calls. Add CDN/WAF rate limiting at the edge for coarse IP-based protection. For the very strictest global limits on sensitive endpoints, a global coordination layer (e.g. a dedicated low-latency counting service) may be worth the added complexity — only build this if you've proven you need it.

📺

References & Videos

System Design: Rate Limiter

ByteByteGo · 8 min

Rate Limiting (Alarm Systems)

Gaurav Sen · 12 min

Distributed Rate Limiting

Hussein Nasser · 20 min

Design a Rate Limiter

AlgoMaster

Scaling your API with rate limiters

Stripe Blog

Requirements

Scale Estimation

API Design

High-Level Architecture

Deep Dive — The Five Algorithms

Key Design Decisions & Tradeoffs

What Can Go Wrong

Anti-patterns

Interview Tips

Similar Problems

How the Design Evolves

References & Videos

Latency Numbers