A rate limiter controls how frequently a client can make requests to an API — protecting backend services from being overwhelmed by traffic spikes, malicious abuse, or runaway retry loops. The challenge is enforcing limits accurately across a distributed fleet of servers without adding meaningful latency to every request.
Limit requests per user ID, IP address, or API key
Support multiple rules — different limits per endpoint
Return 429 Too Many Requests with headers when limit exceeded
Allow configurable limits — rules updatable without redeployment
Support multiple time windows — per second, per minute, per day
Non-Functional
Rate limiting adds <5ms p99 latency per request
Handles 100K+ RPS without degrading
Highly available — a rate limiter failure must not take down the API
Approximate accuracy acceptable — fail open on Redis outage
Rules stored in a config layer — not hardcoded
Clarify Early in Interview
Before drawing anything, ask: What dimension are we limiting on? — user ID, IP, API key, or global per-endpoint? Each changes where the limiter lives and what data it reads. Also ask whether the system is client-facing (fail open) or internal/financial (fail closed). This one question separates strong candidates from weak ones.
03
Scale Estimation
Metric
Calculation
Result
Peak RPS
Assume 10M active users, 10 req/user/min avg
~100K RPS
Redis reads/writes
2 ops per request (INCR + EXPIRE check)
200K ops/s
Redis throughput
Single Redis node: ~200K ops/s. We need cluster.
3–5 nodes
Memory per user
Sliding window counter: 2 integers per user = ~100 bytes
~1 GB / 10M users
Latency budget
Redis in-region RTT ~0.5ms + processing = well under 5ms
<2ms typical
Key Insight
The 2 ops per request is the critical number. At 100K RPS that's 200K Redis operations per second — right at the edge of a single Redis node's capacity. This is why we need a Redis cluster. The memory footprint is tiny (~1 GB for 10M users) because the sliding window counter stores just two integers per user, not a full timestamp log.
04
API Design
The rate limiter is not a standalone REST service with its own endpoints — it's a middleware layer. But the headers it adds to every response are its public interface.
GET/v1/any-endpointNormal allowed response
// Response 200 — includes rate limit state in headers
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000// max requests per windowX-RateLimit-Remaining: 743// requests left this windowX-RateLimit-Reset: 1704067200// Unix timestamp when window resets
GET/v1/any-endpointRate limited response
// Response 429 — client must back off
HTTP/1.1 429 Too Many RequestsX-RateLimit-Limit: 1000X-RateLimit-Remaining: 0X-RateLimit-Reset: 1704067200Retry-After: 47// seconds until client may retry
{ "error": "rate_limit_exceeded", "retry_after_seconds": 47 }
GET/internal/rate-limit/configAdmin — view current rules
Without Retry-After, rejected clients retry immediately — creating a thundering herd that makes the rate limit problem worse. A client that reads this header backs off for exactly 47 seconds and tries once. This transforms a flood of retries into a trickle. This is the answer most candidates miss when asked "how do you handle the 429 response?"
05
High-Level Architecture
Architecture — Where Rate Limiting LivesSVG Diagram
CDN / WAF (Edge)
Blunt IP-level protection. Blocks obvious abuse (DDoS, scrapers) before traffic enters your infrastructure. Coarse-grained — doesn't know user IDs or API keys. Cloudflare and AWS WAF handle this layer.
API Gateway — The Main Layer
Sees the full request context: user ID, API key, endpoint, method. This is where nuanced per-user per-endpoint limits are enforced. One place, all services benefit. No logic duplication across microservices.
Redis Cluster
Stores counters. Atomic INCR prevents race conditions. Cluster mode shards keys across nodes to handle 200K+ ops/sec. TTL auto-expires old windows — no background cleanup jobs needed.
Why Not Inside Each Service?
Duplicating rate limiting logic across 20 microservices means 20 places to fix bugs, 20 separate Redis connections, and no cross-service view. A user could bypass per-service limits by calling multiple services simultaneously. Don't do this.
06
Deep Dive — The Five Algorithms
The Core Question
Every rate limiting algorithm answers one question: "How many requests has this user made in the last N seconds?" The algorithms differ only in how they store that information — and the tradeoff is always between memory cost and accuracy.
// Redis key includes the minute bucket
key = "rate:user123:2024:01:01:12:34"// expires at minute boundary
INCR key
EXPIRE key 60
// The boundary burst problem:// 11:59:50 → 100 requests allowed (end of window)// 12:00:10 → 100 requests allowed (start of next window)// Real span = 20 seconds, 200 requests → 2× your limit
Algorithm 2 — Sliding Window LogO(requests) memory · Perfectly accurate · Too expensive
// Store every request timestamp in a sorted set
key = "rate:user123"
ZREMRANGEBYSCORE key 0 (now - 60s) // remove expired
count = ZCARD key // count remainingif count < limit:
ZADD key now now // log this request// Cost: 1000 req/min = 1000 timestamps stored per user// At 1M users = 1B timestamps in Redis. Too expensive.
// Redis stores just two values per user
{ "tokens": 7, "last_refill": 1704067140 }
// On each request:
elapsed = now - last_refill
new_tokens = min(capacity, tokens + elapsed × refill_rate)
if new_tokens >= 1:
store { tokens: new_tokens - 1, last_refill: now }
return ALLOW
else:
return REJECT
// Burst allowed: idle user accumulates tokens, spends all at once// Used by: AWS API Gateway, Stripe, most user-facing APIs
// Requests queue up; they leave at a fixed drip rate// Output is always uniform regardless of input burst
queue.enqueue(request) // add to bucketif queue.size > capacity:
return REJECT // bucket overflows// Process at fixed rate: 1 request per 10ms regardless// Use case: protecting a downstream system that can't handle bursts// e.g. a payment processor, a slow 3rd-party API
✓ Production Standard
Algorithm 5 — Sliding Window Counter combines fixed window's O(1) memory with near-perfect accuracy. This is what most production systems (Nginx, Kong, Cloudflare) use under the hood.
Algorithm 5 — Sliding Window CounterO(1) memory · ~Perfect accuracy · Production standard
// Store only two counters per user: previous window + current window
prev_count = 80// requests in the previous full minute
curr_count = 40// requests so far in the current minute
elapsed_pct = 0.75// 45 seconds into current 60s window// Interpolate: how much of the previous window is still in our lookback?
prev_weight = 1.0 - elapsed_pct // = 0.25 (25% of prev window remains)
estimate = (prev_count × prev_weight) + curr_count
= (80 × 0.25) + 40
= 20 + 40 = 60// well under limit of 100// Boundary burst example — what fixed window gets wrong:// User sent 100 req at 11:59:30. Now it's 12:00:30 (30s into new window)
estimate = (100 × 0.5) + 0 = 50// correctly blocks burst — only 50 headroom// Memory: just 2 integers per user regardless of traffic volume// Error vs perfect sliding window: ~0.003% in practice
Algorithm
Memory
Burst
Accuracy
Best For
Fixed Window
O(1)
Boundary burst
Flawed
Simple internal tools
Sliding Window Log
O(n)
Controlled
Perfect
Low-traffic, high-accuracy
Token Bucket
O(1)
Intentional
Good
User-facing APIs
Leaky Bucket
O(1)
None
Good
Protecting downstream systems
Sliding Window Counter
O(1)
Controlled
~Perfect
Most production systems ✓
Sequence — Request Allowed vs RejectedMermaid.js
sequenceDiagram
participant C as Client
participant GW as API Gateway
participant RL as Rate Limiter
participant R as Redis
participant S as Service
C->>GW: GET /v1/search
GW->>RL: check_limit(user_id, endpoint)
RL->>R: INCR rate:user123:search:current
R-->>RL: counter = 43
RL->>R: GET rate:user123:search:previous
R-->>RL: prev = 80
Note over RL: estimate = 80×0.25 + 43 = 63 < 100 ✓
RL-->>GW: ALLOW (remaining: 37)
GW->>S: Forward request
S-->>C: 200 OK + X-RateLimit headers
Note over C,S: --- User exceeds limit ---
C->>GW: GET /v1/search (101st request)
GW->>RL: check_limit(user_id, endpoint)
RL->>R: INCR rate:user123:search:current
R-->>RL: counter = 101
Note over RL: estimate = 80×0.25 + 101 = 121 > 100 ✗
RL-->>GW: REJECT
GW-->>C: 429 Too Many Requests + Retry-After: 47
07
Key Design Decisions & Tradeoffs
Decision 1 — Where to place the rate limiter
Option A — Chosen
API Gateway (Centralised)
Single enforcement point. All services benefit automatically. No logic duplication. Gateway already has auth context — knows user ID, API key, which endpoint. Easiest to maintain and update rules.
✓ Standard production choice
Option B
Inside Each Microservice
Maximum flexibility — each service sets its own rules. But logic is duplicated across N services. A cross-service attacker can hit service A 100 times and service B 100 times, bypassing any per-service limit.
If Redis is unreachable, allow all requests through. API stays up and serving users. Rate limiting is temporarily unenforced during the outage. Acceptable for social, search, content APIs where slight over-allowance is harmless.
✓ Availability over consistency
Option B — For financial/sensitive APIs
Fail Closed
If Redis is unreachable, reject all requests with 429. Rate limiting always enforced. But your API is entirely down even though your application servers are healthy. Appropriate for payment APIs, auth endpoints.
~ Consistency over availability
Decision 3 — Multi-region enforcement
Option A — Chosen
Local Budgets (Approximate)
Divide global limit by number of regions. Each region enforces its share independently. No cross-region Redis calls — zero latency overhead. A user could theoretically use ~N× limit if perfectly distributed across N regions, but this is rare and acceptable.
✓ Low latency, good enough accuracy
Option B
Global Redis Sync
All regions point to one global Redis (or use cross-region replication). Perfect global accuracy. But every single request now pays 50–150ms of cross-region latency just to check a counter. Completely defeats the purpose of a multi-region deployment.
✗ Latency kills the design
Decision 4 — Algorithm choice
Option A — Chosen
Sliding Window Counter
O(1) memory. Near-perfect accuracy (~0.003% error). No boundary burst problem. Just two integers per user. Works perfectly with Redis atomic operations. The best cost/accuracy ratio of all five algorithms.
✓ Best of all worlds
Option B — When accuracy is paramount
Sliding Window Log
Mathematically perfect — no approximation at all. But stores every request timestamp. At high traffic, memory explodes. 1M users × 1000 req/min = 1 billion Redis entries. Only worth it for very low traffic or extremely strict accuracy requirements.
~ Only at very low scale
08
What Can Go Wrong
⚡
Race Condition on Distributed Counter
Two API gateway nodes read the counter simultaneously — both see 99, both allow the request, both write 100. User sneaks through 101 requests at the limit boundary. At high concurrency this happens constantly with naive read-modify-write logic.
→ Fix: Redis atomic INCR — reads and writes in a single uninterruptible operation. No race possible.
🔴
Redis Node Goes Down
Redis is your single point of failure for rate limiting. If the Redis cluster loses a primary node, rate limiting is unavailable. The policy question: do you fail open (allow everything, API stays up) or fail closed (block everything, API goes dark)?
→ Fix: Redis Sentinel / Cluster for HA. Fail open for consumer APIs, fail closed for financial endpoints.
🌊
Thundering Herd from Retry Storms
Rate limiter rejects 10,000 users at once. All their clients immediately retry. Another wave of 10,000 requests arrives — all get rejected again. This loop can repeat indefinitely, creating a denial-of-service from your own clients even after the original load spike has passed.
→ Fix: Always return Retry-After header. Clients that respect it stagger retries. Add exponential backoff guidance in API docs.
🌍
Multi-Region Limit Bypass
Global limit is 1,000 req/min split across 3 regions (333 each). A sophisticated client routes 333 requests to US, 333 to EU, 333 to Asia simultaneously. They've sent 999 requests — just under the global limit per region, but used the full budget. In practice this is very hard to exploit but theoretically possible.
→ Fix: Accept the approximation for most use cases. For strict global limits, use a dedicated global coordination service (rare).
🔑
Limit Dimension Mismatch
Rate limiting by IP address seems sensible until a corporate office has 10,000 employees all behind a single NAT IP. Every employee's request counts against one shared bucket — one person's script brings down everyone's access. Conversely, a single attacker behind a VPN can rotate IPs to avoid per-IP limits.
→ Fix: Prefer user ID or API key over IP. Use IP only at the edge for unauthenticated traffic. Layer multiple dimensions.
⚠
Anti-patterns
🚫
Fixed-window counters in Redis
Burst right at the window boundary passes 2× the limit.
✓ Better: Sliding-window log or sliding-window counter; smooths burst at boundary.
🚫
Check limit in app server memory
Each instance has its own counter; user rotates through instances.
✓ Better: Centralized Redis INCR with TTL; or token bucket per user_id.
🚫
One bucket for all traffic
Can't limit per-API or per-user class.
✓ Better: Hierarchy: global → per-tenant → per-endpoint → per-user; deny if any exceeded.
09
Interview Tips
01
Ask what dimension before drawing anything. "Are we limiting by user ID, IP, API key, or global per-endpoint?" The answer completely changes where the limiter lives and what Redis key structure you use. Interviewers love candidates who clarify constraints before jumping to architecture.
02
Know all five algorithms and their tradeoffs. Most candidates only know token bucket and fixed window. Walking through all five — and explaining why sliding window counter wins — demonstrates serious depth. The comparison table in section 06 is your cheat sheet.
03
The 429 response is not just a status code. The most common follow-up question is "what do you return when you reject?" Most candidates say "429 and move on." The complete answer is: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After. Explain that Retry-After prevents thundering herds from retry storms. This answer separates strong candidates from everyone else.
04
Name the CAP tradeoff explicitly. When asked about Redis failure: "This is a CAP theorem decision. Rate limiting is shared distributed state — on a partition we choose between consistency (fail closed) and availability (fail open). For this API I'd choose availability because approximate correctness is acceptable here." Shows architectural thinking, not just implementation knowledge.
05
Bring up multi-region unprompted. Interviewers almost always ask "what happens across multiple data centers?" Getting there first — and explaining local budgets vs global Redis sync — shows you're thinking at scale. Say: "The obvious solution is a global Redis, but cross-region latency of 50–150ms on every request defeats the point. I'd split the budget by region and accept the approximation."
06
Don't forget the rule configuration layer. Rate limits shouldn't be hardcoded in the gateway. Mention a config service (or even a Redis hash) that stores rules per endpoint and per user tier. This shows you've thought about operability — how ops teams change limits without redeploying the entire gateway.
The Answer Most Candidates Miss — 429 Response Headers
HTTP Response — Rate Limited429 Too Many Requests
X-RateLimit-Limit1000Maximum requests allowed in the window
X-RateLimit-Remaining0Requests left in current window
X-RateLimit-Reset1704067200Unix timestamp when the window resets
Retry-After47Seconds until the client may retry — prevents thundering herds
Why This Matters
Without Retry-After, well-meaning clients retry immediately in a loop — creating a thundering herd that makes your rate limiting problem worse. A client that reads this header waits exactly 47 seconds and tries once. This single header transforms a flood of rejected retries into a controlled trickle. Include these headers on every response, not just 429s — clients use X-RateLimit-Remaining to proactively throttle themselves before hitting the limit.
10
Similar Problems
The rate limiter's core pattern — a distributed counter in Redis with atomic operations — appears in several other system design problems.
A simple HashMap in your application server's memory. Works perfectly for a single-server deployment. Ship this first. When you add a second server, the counters diverge and you graduate to Redis. Don't over-engineer before you have the problem.
Phase 2 — Multiple Servers, Single Region
Centralised Redis + Sliding Window Counter
Move counters to a single Redis instance. Add atomic INCR to eliminate race conditions. Implement sliding window counter algorithm for accuracy. This is the correct architecture for 99% of production systems and where most teams stay indefinitely.
Phase 3 — High Traffic, Single Region
Redis Cluster + Rule Config Service
Shard Redis across 3–5 nodes to handle 200K+ ops/sec. Move limit rules from hardcoded config into a dedicated rule store (Redis hash or a lightweight config service). Operators can now adjust limits without redeployment. Add per-endpoint, per-user-tier rules.
Phase 4 — Multi-Region Global
Regional Budgets + Edge Enforcement
Each region gets a fraction of the global limit and enforces it locally. No cross-region Redis calls. Add CDN/WAF rate limiting at the edge for coarse IP-based protection. For the very strictest global limits on sensitive endpoints, a global coordination layer (e.g. a dedicated low-latency counting service) may be worth the added complexity — only build this if you've proven you need it.