Blog/System Design/Design a Rate Limiter: Algorithms and Implementation

POST

November 05, 2025

LAST UPDATEDNovember 05, 2025

Design a Rate Limiter: Algorithms and Implementation

Q: Which rate limiting algorithm is best for API protection?

Token bucket is the most popular because it allows short bursts while enforcing average rates. It is memory-efficient, easy to implement with Redis, and used by AWS, Stripe, and most major API providers.

Q: How do you implement rate limiting across multiple servers?

Use a centralized store like Redis with atomic operations (INCR + EXPIRE) to track request counts. Lua scripts ensure atomicity. For ultra-low latency, use local counters synchronized periodically with the central store.

Q: What HTTP headers should a rate limiter return?

Return X-RateLimit-Limit (max requests), X-RateLimit-Remaining (requests left), X-RateLimit-Reset (window reset time), and Retry-After header with a 429 status code when the limit is exceeded.

Build a distributed rate limiter using token bucket, sliding window, and leaky bucket algorithms. Covers Redis-based implementation and API gateway integration.

Design a Rate Limiter: Algorithms and Implementation

This post applies concepts from the System Design from Zero to Hero series.

TL;DR

A distributed rate limiter uses algorithms like token bucket or sliding window with Redis as a shared counter store to enforce request limits per user, IP, or API key across multiple servers. The choice of algorithm determines the trade-off between memory usage, burst tolerance, and accuracy. In production, rate limiters are typically deployed as middleware in an API gateway, returning standard HTTP headers to communicate limits to clients.

Requirements

Functional Requirements

›Limit requests — Enforce a maximum number of requests per time window for a given key (user ID, IP address, or API key).
›Multiple rules — Support different rate limits for different API endpoints (e.g., 100 requests/minute for reads, 10 requests/minute for writes).
›Inform clients — Return standard HTTP headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) on every response.
›Reject excess requests — Return HTTP 429 (Too Many Requests) with a Retry-After header when the limit is exceeded.
›Configurable rules — Rate limit rules should be configurable without code deployments.

Non-Functional Requirements

›Low latency — The rate limiting check must add less than 1ms to each request.
›Distributed — Must work correctly across multiple API servers sharing the same limits.
›Highly available — If the rate limiter is unavailable, requests should be allowed through (fail-open) rather than blocking all traffic.
›Accurate — Minimize the window where requests can exceed the configured limit (race conditions).
›Memory efficient — Must handle millions of unique keys without excessive memory consumption.

Back-of-Envelope Estimation

Assume an API serving 100,000 requests per second across 50 API servers:

›Rate limit checks per second: 100,000 (one per request)
›Unique keys (users/IPs): 10 million active in any given hour
›Memory per key (token bucket): ~20 bytes (token count + last refill timestamp) → 10M * 20 = ~200 MB
›Memory per key (sliding window log): ~100 bytes per request timestamp → much higher, proportional to the limit
›Redis operations: 100,000 reads + writes per second → well within a single Redis node's capacity (~100K ops/sec)

High-Level Design

Client → Load Balancer → API Gateway (Rate Limiter Middleware)
                              ↓
                         Rate Limiter
                              ↓
                      Redis (Counter Store)
                              ↓
                    [Allowed?] → API Server → Response
                    [Denied?]  → 429 Too Many Requests

The rate limiter sits as middleware in the API gateway, before the request reaches the application server. It checks the current request count against the configured limit using Redis as a shared counter store. If the request is within the limit, it passes through. If not, it returns a 429 response immediately without touching the backend.

Detailed Design

Algorithm 1: Fixed Window Counter

The simplest algorithm. Divide time into fixed windows (e.g., each minute) and count requests in the current window.

python

def fixed_window_check(key: str, limit: int, window_size: int) -> bool:
    """
    key: identifier (user_id, ip, api_key)
    limit: max requests per window
    window_size: window duration in seconds
    """
    current_window = int(time.time() // window_size)
    redis_key = f"ratelimit:{key}:{current_window}"
 
    count = redis.incr(redis_key)
    if count == 1:
        redis.expire(redis_key, window_size)
 
    return count <= limit

Pros: Simple to implement, low memory (one counter per key per window).

Cons: Boundary problem. A user can send limit requests at the end of one window and limit more at the start of the next window, effectively getting 2x the limit within a short burst. For example, with a limit of 100 requests/minute, a user could send 100 requests at 12:00:59 and 100 more at 12:01:01 — 200 requests in 2 seconds.

Algorithm 2: Sliding Window Log

Track the exact timestamp of every request. To check the limit, count timestamps within the last N seconds.

python

def sliding_window_log_check(key: str, limit: int, window_size: int) -> bool:
    now = time.time()
    redis_key = f"ratelimit:{key}"
 
    # Use a Redis sorted set where score = timestamp
    pipeline = redis.pipeline()
    # Remove entries older than the window
    pipeline.zremrangebyscore(redis_key, 0, now - window_size)
    # Count entries in the current window
    pipeline.zcard(redis_key)
    # Add the current request
    pipeline.zadd(redis_key, {str(now): now})
    # Set TTL for cleanup
    pipeline.expire(redis_key, window_size)
 
    results = pipeline.execute()
    count = results[1]
 
    return count < limit

Pros: Precise — no boundary problem. The window slides smoothly.

Cons: High memory usage. If the limit is 1,000 requests/hour, you store 1,000 timestamps per user. With 10 million users, that is 10 billion entries. This does not scale for high-limit scenarios.

Algorithm 3: Sliding Window Counter

A hybrid that approximates the sliding window using two fixed windows. It weighs the previous window's count by the fraction of time remaining.

python

def sliding_window_counter_check(key: str, limit: int, window_size: int) -> bool:
    now = time.time()
    current_window = int(now // window_size)
    previous_window = current_window - 1
 
    # Position within the current window (0.0 to 1.0)
    elapsed_fraction = (now % window_size) / window_size
 
    current_key = f"ratelimit:{key}:{current_window}"
    previous_key = f"ratelimit:{key}:{previous_window}"
 
    current_count = int(redis.get(current_key) or 0)
    previous_count = int(redis.get(previous_key) or 0)
 
    # Weighted count: full current window + remaining fraction of previous
    weighted_count = current_count + previous_count * (1 - elapsed_fraction)
 
    if weighted_count >= limit:
        return False
 
    redis.incr(current_key)
    redis.expire(current_key, window_size * 2)
    return True

Example: With a limit of 100/minute, if the previous minute had 80 requests and we are 40% through the current minute (which has 30 requests so far), the weighted count is: 30 + 80 * 0.6 = 78. Since 78 < 100, the request is allowed.

Pros: Low memory (two counters per key), smooths out the boundary problem.

Cons: Approximate — not perfectly accurate, but close enough for practical rate limiting.

Algorithm 4: Token Bucket (Recommended)

The token bucket is the most widely used algorithm (used by AWS, Stripe, and most major API providers). A bucket holds tokens. Each request consumes one token. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected.

python

def token_bucket_check(key: str, max_tokens: int, refill_rate: float) -> bool:
    """
    max_tokens: bucket capacity (allows bursts up to this amount)
    refill_rate: tokens added per second
    """
    redis_key = f"ratelimit:{key}"
    now = time.time()
 
    # Lua script for atomic check-and-update
    lua_script = """
    local key = KEYS[1]
    local max_tokens = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])
 
    local data = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(data[1]) or max_tokens
    local last_refill = tonumber(data[2]) or now
 
    -- Calculate tokens to add since last refill
    local elapsed = now - last_refill
    tokens = math.min(max_tokens, tokens + elapsed * refill_rate)
 
    if tokens >= 1 then
        tokens = tokens - 1
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
        return 1  -- allowed
    else
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
        return 0  -- denied
    end
    """
 
    result = redis.eval(lua_script, 1, redis_key, max_tokens, refill_rate, now)
    return result == 1

Pros: Allows controlled bursts (up to max_tokens), smooth rate enforcement, low memory (2 values per key), easy to reason about.

Cons: Slightly more complex to implement than fixed window. The burst behavior may not be desirable for some use cases.

Why Lua scripts? The check-and-update operation (read tokens, compute refill, deduct, write back) must be atomic. Without atomicity, two concurrent requests could both read the same token count and both succeed when only one should. Redis Lua scripts execute atomically — no other command can interleave during script execution. For more on Redis as a distributed coordination tool, see Part 5: Caching Strategies.

Rate Limiter as Middleware

The rate limiter is deployed as middleware in the API gateway, not within each application server. This centralizes rate limiting logic and ensures it applies consistently regardless of which backend service handles the request.

Request → TLS Termination → Authentication → Rate Limiting → Routing → Backend

The middleware adds standard HTTP headers to every response:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 67
X-RateLimit-Reset: 1691234567

--- or when limited ---

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1691234567
Retry-After: 23

For API design best practices including rate limit headers, see Part 8: API Design.

Per-User vs Per-IP vs Per-API-Key

Different rate limiting keys serve different purposes:

Per-User (authenticated requests): Rate limit by user_id. This is the most common approach for authenticated APIs. Each user gets their own quota regardless of how many devices or IPs they use.

Per-IP (unauthenticated requests): Rate limit by client IP address. Useful for login endpoints (brute force protection) and public APIs. Caveat: users behind a NAT or corporate proxy share an IP. Use X-Forwarded-For header carefully (it can be spoofed).

Per-API-Key: Rate limit by API key. Common for developer APIs (Stripe, Twilio) where different pricing tiers get different limits. A free tier might get 100 requests/minute while an enterprise tier gets 10,000.

Multi-dimensional rate limiting: Apply multiple rules simultaneously. For example:

›100 requests/minute per user (protects individual quota)
›1,000 requests/minute per IP (protects against shared abuse)
›10 requests/second per user per endpoint (protects expensive endpoints)

Handling Race Conditions in Distributed Rate Limiting

The core race condition: two requests from the same user arrive at different API servers simultaneously. Both read the counter as 99 (limit is 100), both increment to 100, but the actual count should be 101 (one should be rejected).

Solution 1: Redis Lua Scripts (recommended)

Lua scripts execute atomically on the Redis server. The entire check-and-increment happens as a single operation. No other Redis command can interleave.

Solution 2: Redis MULTI/EXEC Transactions

MULTI
INCR ratelimit:user123:current_window
EXPIRE ratelimit:user123:current_window 60
EXEC

MULTI/EXEC ensures the commands execute sequentially, but it does not prevent the TOCTOU (time-of-check, time-of-use) issue because the INCR always succeeds — you check the result after execution. This works for simple counter-based algorithms but not for token bucket where you need to read-compute-write atomically.

Solution 3: Local Rate Limiting with Sync

For ultra-low latency, each API server maintains a local counter and periodically syncs with Redis. This trades accuracy for speed: during the sync interval, the system may allow slightly more requests than the limit. This approach works when approximate rate limiting is acceptable (e.g., allowing 105 requests when the limit is 100).

Data Model

Rate Limit Rules (Configuration Store)

Column	Type	Description
rule_id	VARCHAR	Primary key
key_type	VARCHAR	user / ip / api_key
endpoint	VARCHAR	API endpoint pattern (e.g., /api/v1/*)
algorithm	VARCHAR	token_bucket / sliding_window / fixed
max_requests	INT	Maximum requests per window
window_seconds	INT	Time window duration
burst_size	INT	Max burst (token bucket capacity)
tier	VARCHAR	free / pro / enterprise

Redis Data Structures Per Algorithm

Fixed Window:

Key:   ratelimit:{key}:{window_id}
Type:  String (counter)
TTL:   window_size seconds

Sliding Window Log:

Key:   ratelimit:{key}
Type:  Sorted Set (score = timestamp, member = request_id)
TTL:   window_size seconds

Token Bucket:

Key:   ratelimit:{key}
Type:  Hash {tokens: float, last_refill: float}
TTL:   2 * (max_tokens / refill_rate) seconds

Scaling Considerations

Redis clustering: For millions of unique rate limit keys, use a Redis cluster. Shard by the rate limit key so all requests for a given user hit the same Redis node. This is critical for atomicity — if requests for the same key go to different nodes, the counter is split and the limit is not enforced correctly.

Redis failure handling (fail-open vs fail-closed):

›Fail-open: If Redis is unreachable, allow all requests through. This prevents a Redis outage from becoming a total API outage. The risk is that rate limits are not enforced during the outage.
›Fail-closed: If Redis is unreachable, reject all requests. This is safer for security-sensitive endpoints (login brute force protection) but can cause cascading failures.
›Recommendation: Fail-open for general API rate limiting, fail-closed for security-critical endpoints, with a local in-memory fallback rate limiter that activates during Redis outages.

Multiple data centers: If your API servers span multiple data centers, each with its own Redis, the rate limit is effectively split. A user with a limit of 100/minute could send 100 to each data center. Solutions: use a single global Redis cluster with cross-region replication (adds latency), or accept the approximation and set each region's limit to total_limit / num_regions.

Hot keys: A single user making many requests creates a hot key on one Redis shard. Redis handles this well up to ~100K ops/sec per key. Beyond that, use local caching with periodic sync.

Trade-offs and Alternatives

Decision	Option A	Option B	Recommendation
Algorithm	Token bucket	Sliding window counter	Token bucket for burst control, sliding window for strict limits
Storage	Redis	Local memory	Redis for distributed, local for single-server
Failure mode	Fail-open	Fail-closed	Fail-open for availability, fail-closed for security
Atomicity	Lua scripts	MULTI/EXEC	Lua scripts for complex algorithms
Placement	API gateway	Application code	API gateway for centralized enforcement

Why not rate limit at the application layer? Application-layer rate limiting works for single-server deployments, but in a distributed system, each server only sees a fraction of the traffic. A user sending 100 requests spread across 10 servers would see a limit of 10 on each server but use 100 total. Centralized rate limiting via Redis ensures a single view of usage.

Why not use a leaky bucket? The leaky bucket algorithm processes requests at a fixed rate, queuing excess requests. While it provides smooth output, it adds latency to queued requests (they must wait). Token bucket is preferred because it allows bursts while still enforcing average rates — requests are either allowed instantly or rejected, with no queuing delay.

Commercial alternatives: Cloud providers offer built-in rate limiting (AWS API Gateway, Cloudflare Rate Limiting, Kong). These are operationally simpler but less customizable. For most applications, a managed solution is sufficient. Build your own only when you need custom multi-dimensional rules, per-user tiering, or integration with your specific billing system.

FAQ

Which rate limiting algorithm is best for API protection?

Token bucket is the most popular because it allows short bursts while enforcing average rates. It is memory-efficient, easy to implement with Redis, and used by AWS, Stripe, and most major API providers. The two parameters (bucket capacity and refill rate) map intuitively to business requirements: capacity controls burst size, and refill rate controls sustained throughput. For example, a capacity of 10 and a refill rate of 2/second allows a burst of 10 requests followed by a sustained rate of 2 requests/second.

How do you implement rate limiting across multiple servers?

Use a centralized store like Redis with atomic operations (INCR + EXPIRE) to track request counts. Lua scripts ensure atomicity. For ultra-low latency, use local counters synchronized periodically with the central store. The key insight is that all servers must share a single source of truth for each rate limit key. Redis serves this role well because it is single-threaded (no lock contention), supports atomic operations natively, and provides sub-millisecond response times. A Redis cluster can handle millions of rate limit checks per second.

What HTTP headers should a rate limiter return?

Return X-RateLimit-Limit (max requests), X-RateLimit-Remaining (requests left), X-RateLimit-Reset (window reset time as a Unix timestamp), and Retry-After header with a 429 status code when the limit is exceeded. These headers are not part of an official HTTP standard but are a widely adopted convention. The Retry-After header tells the client exactly how many seconds to wait before retrying, which prevents clients from hammering the server with immediate retries. Well-behaved API clients use these headers to implement client-side throttling.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

Start a Conversation

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Building a Configurable Form Engine with Conditional Logic

Next.js Middleware for Authentication and Authorization

Design an E-Commerce Order Processing System

Jan 10, 202612 min read

System Design

E-Commerce

Saga Pattern

Design an E-Commerce Order Processing System

Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.

Read Article

Monitoring, Observability, and Site Reliability

Dec 10, 20259 min read

System Design

Observability

Monitoring

Monitoring, Observability, and Site Reliability

Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.

Read Article

Nov 12, 202510 min read

System Design

CAP Theorem

Distributed Systems

CAP Theorem and Distributed Consensus

Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.

Read Article

Design a Rate Limiter: Algorithms and Implementation

Design a Rate Limiter: Algorithms and Implementation

TL;DR

Requirements

Functional Requirements

Non-Functional Requirements

Back-of-Envelope Estimation

High-Level Design

Detailed Design

Algorithm 1: Fixed Window Counter

Algorithm 2: Sliding Window Log

Algorithm 3: Sliding Window Counter

Algorithm 4: Token Bucket (Recommended)

Rate Limiter as Middleware

Per-User vs Per-IP vs Per-API-Key

Handling Race Conditions in Distributed Rate Limiting

Data Model

Rate Limit Rules (Configuration Store)

Redis Data Structures Per Algorithm

Scaling Considerations

Trade-offs and Alternatives

FAQ

Which rate limiting algorithm is best for API protection?

How do you implement rate limiting across multiple servers?

What HTTP headers should a rate limiter return?

Need help with a project?

Let's Build It

Sadam Hussain

Related Articles

Design an E-Commerce Order Processing System

Monitoring, Observability, and Site Reliability

CAP Theorem and Distributed Consensus