Design a Rate Limiter: Algorithms and Implementation
Build a distributed rate limiter using token bucket, sliding window, and leaky bucket algorithms. Covers Redis-based implementation and API gateway integration.
Tags
Design a Rate Limiter: Algorithms and Implementation
This post applies concepts from the System Design from Zero to Hero series.
TL;DR
A distributed rate limiter uses algorithms like token bucket or sliding window with Redis as a shared counter store to enforce request limits per user, IP, or API key across multiple servers. The choice of algorithm determines the trade-off between memory usage, burst tolerance, and accuracy. In production, rate limiters are typically deployed as middleware in an API gateway, returning standard HTTP headers to communicate limits to clients.
Requirements
Functional Requirements
- ›Limit requests — Enforce a maximum number of requests per time window for a given key (user ID, IP address, or API key).
- ›Multiple rules — Support different rate limits for different API endpoints (e.g., 100 requests/minute for reads, 10 requests/minute for writes).
- ›Inform clients — Return standard HTTP headers (
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset) on every response. - ›Reject excess requests — Return HTTP 429 (Too Many Requests) with a
Retry-Afterheader when the limit is exceeded. - ›Configurable rules — Rate limit rules should be configurable without code deployments.
Non-Functional Requirements
- ›Low latency — The rate limiting check must add less than 1ms to each request.
- ›Distributed — Must work correctly across multiple API servers sharing the same limits.
- ›Highly available — If the rate limiter is unavailable, requests should be allowed through (fail-open) rather than blocking all traffic.
- ›Accurate — Minimize the window where requests can exceed the configured limit (race conditions).
- ›Memory efficient — Must handle millions of unique keys without excessive memory consumption.
Back-of-Envelope Estimation
Assume an API serving 100,000 requests per second across 50 API servers:
- ›Rate limit checks per second: 100,000 (one per request)
- ›Unique keys (users/IPs): 10 million active in any given hour
- ›Memory per key (token bucket): ~20 bytes (token count + last refill timestamp) → 10M * 20 = ~200 MB
- ›Memory per key (sliding window log): ~100 bytes per request timestamp → much higher, proportional to the limit
- ›Redis operations: 100,000 reads + writes per second → well within a single Redis node's capacity (~100K ops/sec)
High-Level Design
Client → Load Balancer → API Gateway (Rate Limiter Middleware)
↓
Rate Limiter
↓
Redis (Counter Store)
↓
[Allowed?] → API Server → Response
[Denied?] → 429 Too Many Requests
The rate limiter sits as middleware in the API gateway, before the request reaches the application server. It checks the current request count against the configured limit using Redis as a shared counter store. If the request is within the limit, it passes through. If not, it returns a 429 response immediately without touching the backend.
Detailed Design
Algorithm 1: Fixed Window Counter
The simplest algorithm. Divide time into fixed windows (e.g., each minute) and count requests in the current window.
def fixed_window_check(key: str, limit: int, window_size: int) -> bool:
"""
key: identifier (user_id, ip, api_key)
limit: max requests per window
window_size: window duration in seconds
"""
current_window = int(time.time() // window_size)
redis_key = f"ratelimit:{key}:{current_window}"
count = redis.incr(redis_key)
if count == 1:
redis.expire(redis_key, window_size)
return count <= limitPros: Simple to implement, low memory (one counter per key per window).
Cons: Boundary problem. A user can send limit requests at the end of one window and limit more at the start of the next window, effectively getting 2x the limit within a short burst. For example, with a limit of 100 requests/minute, a user could send 100 requests at 12:00:59 and 100 more at 12:01:01 — 200 requests in 2 seconds.
Algorithm 2: Sliding Window Log
Track the exact timestamp of every request. To check the limit, count timestamps within the last N seconds.
def sliding_window_log_check(key: str, limit: int, window_size: int) -> bool:
now = time.time()
redis_key = f"ratelimit:{key}"
# Use a Redis sorted set where score = timestamp
pipeline = redis.pipeline()
# Remove entries older than the window
pipeline.zremrangebyscore(redis_key, 0, now - window_size)
# Count entries in the current window
pipeline.zcard(redis_key)
# Add the current request
pipeline.zadd(redis_key, {str(now): now})
# Set TTL for cleanup
pipeline.expire(redis_key, window_size)
results = pipeline.execute()
count = results[1]
return count < limitPros: Precise — no boundary problem. The window slides smoothly.
Cons: High memory usage. If the limit is 1,000 requests/hour, you store 1,000 timestamps per user. With 10 million users, that is 10 billion entries. This does not scale for high-limit scenarios.
Algorithm 3: Sliding Window Counter
A hybrid that approximates the sliding window using two fixed windows. It weighs the previous window's count by the fraction of time remaining.
def sliding_window_counter_check(key: str, limit: int, window_size: int) -> bool:
now = time.time()
current_window = int(now // window_size)
previous_window = current_window - 1
# Position within the current window (0.0 to 1.0)
elapsed_fraction = (now % window_size) / window_size
current_key = f"ratelimit:{key}:{current_window}"
previous_key = f"ratelimit:{key}:{previous_window}"
current_count = int(redis.get(current_key) or 0)
previous_count = int(redis.get(previous_key) or 0)
# Weighted count: full current window + remaining fraction of previous
weighted_count = current_count + previous_count * (1 - elapsed_fraction)
if weighted_count >= limit:
return False
redis.incr(current_key)
redis.expire(current_key, window_size * 2)
return TrueExample: With a limit of 100/minute, if the previous minute had 80 requests and we are 40% through the current minute (which has 30 requests so far), the weighted count is: 30 + 80 * 0.6 = 78. Since 78 < 100, the request is allowed.
Pros: Low memory (two counters per key), smooths out the boundary problem.
Cons: Approximate — not perfectly accurate, but close enough for practical rate limiting.
Algorithm 4: Token Bucket (Recommended)
The token bucket is the most widely used algorithm (used by AWS, Stripe, and most major API providers). A bucket holds tokens. Each request consumes one token. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected.
def token_bucket_check(key: str, max_tokens: int, refill_rate: float) -> bool:
"""
max_tokens: bucket capacity (allows bursts up to this amount)
refill_rate: tokens added per second
"""
redis_key = f"ratelimit:{key}"
now = time.time()
# Lua script for atomic check-and-update
lua_script = """
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or max_tokens
local last_refill = tonumber(data[2]) or now
-- Calculate tokens to add since last refill
local elapsed = now - last_refill
tokens = math.min(max_tokens, tokens + elapsed * refill_rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
return 1 -- allowed
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
return 0 -- denied
end
"""
result = redis.eval(lua_script, 1, redis_key, max_tokens, refill_rate, now)
return result == 1Pros: Allows controlled bursts (up to max_tokens), smooth rate enforcement, low memory (2 values per key), easy to reason about.
Cons: Slightly more complex to implement than fixed window. The burst behavior may not be desirable for some use cases.
Why Lua scripts? The check-and-update operation (read tokens, compute refill, deduct, write back) must be atomic. Without atomicity, two concurrent requests could both read the same token count and both succeed when only one should. Redis Lua scripts execute atomically — no other command can interleave during script execution. For more on Redis as a distributed coordination tool, see Part 5: Caching Strategies.
Rate Limiter as Middleware
The rate limiter is deployed as middleware in the API gateway, not within each application server. This centralizes rate limiting logic and ensures it applies consistently regardless of which backend service handles the request.
Request → TLS Termination → Authentication → Rate Limiting → Routing → Backend
The middleware adds standard HTTP headers to every response:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 67
X-RateLimit-Reset: 1691234567
--- or when limited ---
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1691234567
Retry-After: 23
For API design best practices including rate limit headers, see Part 8: API Design.
Per-User vs Per-IP vs Per-API-Key
Different rate limiting keys serve different purposes:
Per-User (authenticated requests):
Rate limit by user_id. This is the most common approach for authenticated APIs. Each user gets their own quota regardless of how many devices or IPs they use.
Per-IP (unauthenticated requests):
Rate limit by client IP address. Useful for login endpoints (brute force protection) and public APIs. Caveat: users behind a NAT or corporate proxy share an IP. Use X-Forwarded-For header carefully (it can be spoofed).
Per-API-Key: Rate limit by API key. Common for developer APIs (Stripe, Twilio) where different pricing tiers get different limits. A free tier might get 100 requests/minute while an enterprise tier gets 10,000.
Multi-dimensional rate limiting: Apply multiple rules simultaneously. For example:
- ›100 requests/minute per user (protects individual quota)
- ›1,000 requests/minute per IP (protects against shared abuse)
- ›10 requests/second per user per endpoint (protects expensive endpoints)
Handling Race Conditions in Distributed Rate Limiting
The core race condition: two requests from the same user arrive at different API servers simultaneously. Both read the counter as 99 (limit is 100), both increment to 100, but the actual count should be 101 (one should be rejected).
Solution 1: Redis Lua Scripts (recommended)
Lua scripts execute atomically on the Redis server. The entire check-and-increment happens as a single operation. No other Redis command can interleave.
Solution 2: Redis MULTI/EXEC Transactions
MULTI
INCR ratelimit:user123:current_window
EXPIRE ratelimit:user123:current_window 60
EXEC
MULTI/EXEC ensures the commands execute sequentially, but it does not prevent the TOCTOU (time-of-check, time-of-use) issue because the INCR always succeeds — you check the result after execution. This works for simple counter-based algorithms but not for token bucket where you need to read-compute-write atomically.
Solution 3: Local Rate Limiting with Sync
For ultra-low latency, each API server maintains a local counter and periodically syncs with Redis. This trades accuracy for speed: during the sync interval, the system may allow slightly more requests than the limit. This approach works when approximate rate limiting is acceptable (e.g., allowing 105 requests when the limit is 100).
Data Model
Rate Limit Rules (Configuration Store)
| Column | Type | Description |
|---|---|---|
| rule_id | VARCHAR | Primary key |
| key_type | VARCHAR | user / ip / api_key |
| endpoint | VARCHAR | API endpoint pattern (e.g., /api/v1/*) |
| algorithm | VARCHAR | token_bucket / sliding_window / fixed |
| max_requests | INT | Maximum requests per window |
| window_seconds | INT | Time window duration |
| burst_size | INT | Max burst (token bucket capacity) |
| tier | VARCHAR | free / pro / enterprise |
Redis Data Structures Per Algorithm
Fixed Window:
Key: ratelimit:{key}:{window_id}
Type: String (counter)
TTL: window_size seconds
Sliding Window Log:
Key: ratelimit:{key}
Type: Sorted Set (score = timestamp, member = request_id)
TTL: window_size seconds
Token Bucket:
Key: ratelimit:{key}
Type: Hash {tokens: float, last_refill: float}
TTL: 2 * (max_tokens / refill_rate) seconds
Scaling Considerations
Redis clustering: For millions of unique rate limit keys, use a Redis cluster. Shard by the rate limit key so all requests for a given user hit the same Redis node. This is critical for atomicity — if requests for the same key go to different nodes, the counter is split and the limit is not enforced correctly.
Redis failure handling (fail-open vs fail-closed):
- ›Fail-open: If Redis is unreachable, allow all requests through. This prevents a Redis outage from becoming a total API outage. The risk is that rate limits are not enforced during the outage.
- ›Fail-closed: If Redis is unreachable, reject all requests. This is safer for security-sensitive endpoints (login brute force protection) but can cause cascading failures.
- ›Recommendation: Fail-open for general API rate limiting, fail-closed for security-critical endpoints, with a local in-memory fallback rate limiter that activates during Redis outages.
Multiple data centers: If your API servers span multiple data centers, each with its own Redis, the rate limit is effectively split. A user with a limit of 100/minute could send 100 to each data center. Solutions: use a single global Redis cluster with cross-region replication (adds latency), or accept the approximation and set each region's limit to total_limit / num_regions.
Hot keys: A single user making many requests creates a hot key on one Redis shard. Redis handles this well up to ~100K ops/sec per key. Beyond that, use local caching with periodic sync.
Trade-offs and Alternatives
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| Algorithm | Token bucket | Sliding window counter | Token bucket for burst control, sliding window for strict limits |
| Storage | Redis | Local memory | Redis for distributed, local for single-server |
| Failure mode | Fail-open | Fail-closed | Fail-open for availability, fail-closed for security |
| Atomicity | Lua scripts | MULTI/EXEC | Lua scripts for complex algorithms |
| Placement | API gateway | Application code | API gateway for centralized enforcement |
Why not rate limit at the application layer? Application-layer rate limiting works for single-server deployments, but in a distributed system, each server only sees a fraction of the traffic. A user sending 100 requests spread across 10 servers would see a limit of 10 on each server but use 100 total. Centralized rate limiting via Redis ensures a single view of usage.
Why not use a leaky bucket? The leaky bucket algorithm processes requests at a fixed rate, queuing excess requests. While it provides smooth output, it adds latency to queued requests (they must wait). Token bucket is preferred because it allows bursts while still enforcing average rates — requests are either allowed instantly or rejected, with no queuing delay.
Commercial alternatives: Cloud providers offer built-in rate limiting (AWS API Gateway, Cloudflare Rate Limiting, Kong). These are operationally simpler but less customizable. For most applications, a managed solution is sufficient. Build your own only when you need custom multi-dimensional rules, per-user tiering, or integration with your specific billing system.
FAQ
Which rate limiting algorithm is best for API protection?
Token bucket is the most popular because it allows short bursts while enforcing average rates. It is memory-efficient, easy to implement with Redis, and used by AWS, Stripe, and most major API providers. The two parameters (bucket capacity and refill rate) map intuitively to business requirements: capacity controls burst size, and refill rate controls sustained throughput. For example, a capacity of 10 and a refill rate of 2/second allows a burst of 10 requests followed by a sustained rate of 2 requests/second.
How do you implement rate limiting across multiple servers?
Use a centralized store like Redis with atomic operations (INCR + EXPIRE) to track request counts. Lua scripts ensure atomicity. For ultra-low latency, use local counters synchronized periodically with the central store. The key insight is that all servers must share a single source of truth for each rate limit key. Redis serves this role well because it is single-threaded (no lock contention), supports atomic operations natively, and provides sub-millisecond response times. A Redis cluster can handle millions of rate limit checks per second.
What HTTP headers should a rate limiter return?
Return X-RateLimit-Limit (max requests), X-RateLimit-Remaining (requests left), X-RateLimit-Reset (window reset time as a Unix timestamp), and Retry-After header with a 429 status code when the limit is exceeded. These headers are not part of an official HTTP standard but are a widely adopted convention. The Retry-After header tells the client exactly how many seconds to wait before retrying, which prevents clients from hammering the server with immediate retries. Well-behaved API clients use these headers to implement client-side throttling.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
Design an E-Commerce Order Processing System
Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.
Monitoring, Observability, and Site Reliability
Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.
CAP Theorem and Distributed Consensus
Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.