API Design, Rate Limiting, and Authentication
Design robust APIs with proper rate limiting, OAuth2 authentication, and versioning. Learn RESTful best practices and API gateway patterns for production systems.
Tags
API Design, Rate Limiting, and Authentication
This is Part 8 of the System Design from Zero to Hero series.
TL;DR
Well-designed APIs use consistent naming, proper HTTP semantics, versioning, rate limiting to prevent abuse, and token-based authentication to secure every endpoint. This post covers REST best practices, GraphQL trade-offs, rate limiting algorithms, OAuth2 flows, and API gateway patterns that protect your distributed system at the edge.
Why This Matters
You can have perfectly sharded databases (Part 7), robust caching (Part 5), and resilient message queues (Part 6) -- but if your API is poorly designed, clients will misuse it, attackers will abuse it, and your team will struggle to evolve it.
APIs are contracts. Once published, they are extremely difficult to change without breaking clients. The decisions you make at the API layer ripple through every consumer for years. Rate limiting and authentication are not afterthoughts -- they are the first line of defense for every production system.
Core Concepts
REST API Best Practices
REST is not a protocol -- it is an architectural style built on top of HTTP. Here are the conventions that make REST APIs predictable and easy to consume.
Resource Naming:
# Good - nouns, plural, hierarchical
GET /api/v1/users
GET /api/v1/users/123
GET /api/v1/users/123/orders
POST /api/v1/users/123/orders
# Bad - verbs in URLs, inconsistent pluralization
GET /api/v1/getUser/123
POST /api/v1/createOrder
GET /api/v1/user/123/order
HTTP Methods:
| Method | Purpose | Idempotent | Safe |
|---|---|---|---|
| GET | Read a resource | Yes | Yes |
| POST | Create a resource | No | No |
| PUT | Replace a resource entirely | Yes | No |
| PATCH | Partially update a resource | No | No |
| DELETE | Remove a resource | Yes | No |
Idempotency matters because network failures cause retries. A PUT request with the same payload should produce the same result regardless of how many times it is sent. POST is not idempotent -- sending the same "create order" request twice creates two orders unless you implement idempotency keys.
Status Codes:
200 OK - Successful read or update
201 Created - Resource created (include Location header)
204 No Content - Successful delete
400 Bad Request - Client sent invalid data
401 Unauthorized - Missing or invalid authentication
403 Forbidden - Authenticated but not authorized
404 Not Found - Resource does not exist
409 Conflict - Resource state conflict (duplicate, version mismatch)
429 Too Many Requests - Rate limit exceeded (include Retry-After header)
500 Internal Error - Server-side failure
503 Service Unavailable - Temporarily overloaded (include Retry-After)
Pagination:
# Cursor-based pagination (preferred for large datasets)
# GET /api/v1/orders?cursor=eyJpZCI6MTAwfQ&limit=20
@app.get("/api/v1/orders")
def list_orders(cursor: str = None, limit: int = 20):
if cursor:
decoded = base64.b64decode(cursor)
last_id = json.loads(decoded)["id"]
orders = db.query(Order).filter(Order.id > last_id).limit(limit + 1).all()
else:
orders = db.query(Order).limit(limit + 1).all()
has_next = len(orders) > limit
orders = orders[:limit]
next_cursor = None
if has_next:
next_cursor = base64.b64encode(
json.dumps({"id": orders[-1].id}).encode()
).decode()
return {
"data": orders,
"pagination": {
"next_cursor": next_cursor,
"has_next": has_next
}
}Cursor-based pagination avoids the offset problem where OFFSET 1000000 forces the database to scan and skip a million rows. As we discussed in Part 4, query performance degrades with large offsets.
GraphQL: When REST Falls Short
GraphQL solves specific problems that REST handles poorly:
Over-fetching: A mobile client needs just name and avatar, but the REST endpoint returns 30 fields.
Under-fetching: Displaying a user profile requires 3 separate REST calls (user, posts, followers).
Rapid iteration: Frontend teams can change their queries without waiting for backend API changes.
# Single GraphQL query replaces multiple REST calls
query UserProfile($id: ID!) {
user(id: $id) {
name
avatar
posts(last: 5) {
title
createdAt
}
followers {
totalCount
}
}
}GraphQL trade-offs:
- ›Caching is harder (no URL-based HTTP caching, need persisted queries or APQ)
- ›N+1 query problems require DataLoader pattern
- ›Rate limiting is complex (one query can trigger thousands of database operations)
- ›File uploads require workarounds
- ›Learning curve for teams familiar with REST
Use GraphQL when you have multiple client types (web, mobile, third-party) with different data needs. Stick with REST for simple CRUD APIs, public APIs, or when HTTP caching is critical.
API Gateway Pattern
An API gateway sits between clients and your backend services, handling cross-cutting concerns:
Client --> API Gateway --> Service A
--> Service B
--> Service C
Gateway responsibilities:
- Authentication and authorization
- Rate limiting
- Request routing
- Response aggregation
- SSL termination
- Request/response transformation
- Logging and monitoring
This relates directly to the load balancing concepts from Part 3. The API gateway often sits behind a load balancer and routes requests to the appropriate microservice.
# Kong API Gateway configuration example
services:
- name: user-service
url: http://user-service:8080
routes:
- name: user-routes
paths:
- /api/v1/users
plugins:
- name: rate-limiting
config:
minute: 100
policy: redis
redis_host: redis
- name: jwt
config:
secret_is_base64: false
- name: cors
config:
origins: ["https://myapp.com"]
methods: ["GET", "POST", "PUT", "DELETE"]Rate Limiting Algorithms
Rate limiting protects your system from abuse, ensures fair usage, and prevents cascading failures. There are four main algorithms:
1. Fixed Window Counter
Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window.
import redis
import time
redis_client = redis.Redis()
def fixed_window_rate_limit(client_id: str, limit: int, window_seconds: int) -> bool:
window_key = f"ratelimit:{client_id}:{int(time.time()) // window_seconds}"
current = redis_client.incr(window_key)
if current == 1:
redis_client.expire(window_key, window_seconds)
return current <= limitProblem: Burst at window boundaries. A client can send 100 requests at 11:59:59 and 100 more at 12:00:00, effectively getting 200 requests in 2 seconds.
2. Sliding Window Log
Store timestamps of all requests. Count requests within the sliding window.
def sliding_window_log(client_id: str, limit: int, window_seconds: int) -> bool:
now = time.time()
key = f"ratelimit:log:{client_id}"
pipe = redis_client.pipeline()
pipe.zremrangebyscore(key, 0, now - window_seconds) # Remove old entries
pipe.zadd(key, {str(now): now}) # Add current request
pipe.zcard(key) # Count entries
pipe.expire(key, window_seconds)
results = pipe.execute()
return results[2] <= limitPrecise but memory-intensive. Every request timestamp is stored, which is impractical at high volumes.
3. Sliding Window Counter
Combines fixed window efficiency with sliding window accuracy by weighting the previous window:
def sliding_window_counter(client_id: str, limit: int, window_seconds: int) -> bool:
now = time.time()
current_window = int(now) // window_seconds
previous_window = current_window - 1
# How far we are into the current window (0.0 to 1.0)
elapsed_ratio = (now % window_seconds) / window_seconds
current_key = f"ratelimit:{client_id}:{current_window}"
previous_key = f"ratelimit:{client_id}:{previous_window}"
current_count = int(redis_client.get(current_key) or 0)
previous_count = int(redis_client.get(previous_key) or 0)
# Weighted count: full current + proportional previous
weighted_count = current_count + previous_count * (1 - elapsed_ratio)
if weighted_count >= limit:
return False
redis_client.incr(current_key)
redis_client.expire(current_key, window_seconds * 2)
return TrueThis is a practical choice for most production systems -- low memory, good accuracy.
4. Token Bucket
A bucket holds tokens up to a maximum capacity. Each request consumes one token. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected.
def token_bucket(client_id: str, capacity: int, refill_rate: float) -> bool:
"""
capacity: max burst size
refill_rate: tokens added per second
"""
key = f"ratelimit:bucket:{client_id}"
now = time.time()
bucket = redis_client.hmget(key, "tokens", "last_refill")
tokens = float(bucket[0]) if bucket[0] else capacity
last_refill = float(bucket[1]) if bucket[1] else now
# Add tokens based on elapsed time
elapsed = now - last_refill
tokens = min(capacity, tokens + elapsed * refill_rate)
if tokens < 1:
return False
tokens -= 1
redis_client.hmset(key, {"tokens": tokens, "last_refill": now})
redis_client.expire(key, int(capacity / refill_rate) + 1)
return TrueToken bucket is the most widely used algorithm in production because it naturally handles burst traffic while enforcing a long-term average rate. AWS API Gateway, Stripe, and GitHub all use variations of token bucket.
OAuth2 Flows
OAuth2 is the standard for API authentication and authorization. The flow you choose depends on the client type:
Authorization Code Flow (web apps with a backend):
1. Client redirects user to authorization server
2. User authenticates and consents
3. Authorization server redirects back with an authorization code
4. Client exchanges code for access token (server-to-server, code is short-lived)
5. Client uses access token to call APIs
Authorization Code with PKCE (mobile/SPA): Same as above, but the client generates a code verifier and challenge to prevent authorization code interception. This is the recommended flow for all public clients.
Client Credentials (service-to-service):
# Service-to-service authentication
import requests
response = requests.post("https://auth.example.com/oauth/token", data={
"grant_type": "client_credentials",
"client_id": "service-orders",
"client_secret": "secret",
"scope": "read:users"
})
access_token = response.json()["access_token"]
# Use the token
headers = {"Authorization": f"Bearer {access_token}"}
users = requests.get("https://api.example.com/users", headers=headers)API Key Management
API keys are simpler than OAuth2 but less secure. They are appropriate for server-to-server calls where the key can be kept secret.
Best practices:
- ›Generate keys with sufficient entropy (at least 32 random bytes, base64-encoded)
- ›Hash keys before storing them (treat them like passwords)
- ›Support key rotation with overlapping validity periods
- ›Scope keys to specific permissions and rate limits
- ›Include a prefix for easy identification (
sk_live_,pk_test_)
API Versioning Strategies
URL path versioning: /api/v1/users -- Simple, explicit, easy to route. Most common approach.
Header versioning: Accept: application/vnd.myapi.v2+json -- Cleaner URLs but harder to test in a browser.
Query parameter: /api/users?version=2 -- Easy to implement but pollutes the URL.
No versioning (evolve in place): Add fields, never remove or rename. Use feature flags. Works for internal APIs.
The pragmatic choice for most teams is URL path versioning. It is explicit, cacheable, and easy to route at the API gateway level.
Practical Implementation
Here is a complete rate-limited API endpoint using FastAPI:
from fastapi import FastAPI, Request, HTTPException, Depends
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import jwt
import redis
import time
app = FastAPI()
security = HTTPBearer()
redis_client = redis.Redis(host='localhost', port=6379)
SECRET_KEY = "your-secret-key"
def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
try:
payload = jwt.decode(credentials.credentials, SECRET_KEY, algorithms=["HS256"])
return payload
except jwt.ExpiredSignatureError:
raise HTTPException(status_code=401, detail="Token expired")
except jwt.InvalidTokenError:
raise HTTPException(status_code=401, detail="Invalid token")
def rate_limit(request: Request, user=Depends(verify_token)):
client_id = user.get("sub", request.client.host)
limit = user.get("rate_limit", 100) # Per-user configurable limits
if not sliding_window_counter(client_id, limit, window_seconds=60):
raise HTTPException(
status_code=429,
detail="Rate limit exceeded",
headers={"Retry-After": "60"}
)
return user
@app.get("/api/v1/orders")
def list_orders(user=Depends(rate_limit)):
# Authenticated, rate-limited endpoint
return {"orders": get_user_orders(user["sub"])}Trade-offs and Decision Framework
| Concern | Simple Approach | Production Approach |
|---|---|---|
| Auth | API keys | OAuth2 with PKCE + JWTs |
| Rate Limiting | Fixed window per IP | Token bucket per user + per IP with Redis |
| Versioning | URL path | URL path + sunset headers |
| Pagination | Offset-based | Cursor-based |
| Error Format | Status code only | RFC 7807 Problem Details |
| Documentation | Manual | OpenAPI/Swagger auto-generated |
Common Interview Questions
Q: How would you design rate limiting for a global API serving millions of requests per second?
A: Use a token bucket algorithm backed by Redis. Deploy Redis clusters in each region for low latency. Use a local in-memory counter as a first pass (to reduce Redis calls) with periodic synchronization. Set different limits per tier (free, pro, enterprise). Return 429 with Retry-After header. For distributed rate limiting across regions, accept slight over-limit (eventually consistent counters) or use a centralized counter with higher latency.
Q: A client needs data from 5 microservices to render one page. How do you optimize this? A: Use an API gateway to aggregate responses server-side (Backend for Frontend pattern). Alternatively, use GraphQL to let the client specify exactly what it needs in one request. For frequently accessed combinations, create a dedicated composite endpoint. Ensure downstream services are cached as described in Part 5.
Q: How do you handle API versioning when you need to make a breaking change?
A: Deploy the new version alongside the old one. Communicate the deprecation timeline (minimum 6-12 months for public APIs). Add Sunset and Deprecation headers to old version responses. Monitor usage of the old version. Provide a migration guide. Only decommission after confirming zero or negligible traffic.
Q: How do you prevent API abuse beyond rate limiting? A: Layer multiple defenses: rate limiting per user and per IP, request size limits, input validation, authentication on all endpoints, CORS configuration, WAF rules for common attack patterns, anomaly detection on usage patterns, and IP reputation scoring.
What's Next
With APIs secured and rate-limited, Part 9: CAP Theorem and Distributed Consensus dives into the fundamental trade-offs that govern how distributed systems maintain consistency when things go wrong.
FAQ
What are the most common rate limiting algorithms?
The four main algorithms are token bucket (bursty traffic), leaky bucket (smooth output), fixed window counter (simple), and sliding window log (precise). Token bucket is the most widely used in production.
Should I use REST or GraphQL for my API?
Use REST for simple CRUD operations with well-defined resources. Use GraphQL when clients need flexible queries, you want to avoid over-fetching, or your frontend teams need to iterate without backend changes.
How do I version my API without breaking existing clients?
Use URL path versioning (/v1/users) for simplicity or header-based versioning for cleaner URLs. Always maintain backward compatibility within a version and give clients a deprecation timeline for older versions.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
Design an E-Commerce Order Processing System
Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.
Monitoring, Observability, and Site Reliability
Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.
CAP Theorem and Distributed Consensus
Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.