Blog/System Design/API Design, Rate Limiting, and Authentication
POST
October 15, 2025
LAST UPDATEDOctober 15, 2025

API Design, Rate Limiting, and Authentication

Design robust APIs with proper rate limiting, OAuth2 authentication, and versioning. Learn RESTful best practices and API gateway patterns for production systems.

Tags

System DesignAPI DesignRate LimitingOAuth2Security
API Design, Rate Limiting, and Authentication
7 min read

API Design, Rate Limiting, and Authentication

This is Part 8 of the System Design from Zero to Hero series.

TL;DR

Well-designed APIs use consistent naming, proper HTTP semantics, versioning, rate limiting to prevent abuse, and token-based authentication to secure every endpoint. This post covers REST best practices, GraphQL trade-offs, rate limiting algorithms, OAuth2 flows, and API gateway patterns that protect your distributed system at the edge.

Why This Matters

You can have perfectly sharded databases (Part 7), robust caching (Part 5), and resilient message queues (Part 6) -- but if your API is poorly designed, clients will misuse it, attackers will abuse it, and your team will struggle to evolve it.

APIs are contracts. Once published, they are extremely difficult to change without breaking clients. The decisions you make at the API layer ripple through every consumer for years. Rate limiting and authentication are not afterthoughts -- they are the first line of defense for every production system.

Core Concepts

REST API Best Practices

REST is not a protocol -- it is an architectural style built on top of HTTP. Here are the conventions that make REST APIs predictable and easy to consume.

Resource Naming:

# Good - nouns, plural, hierarchical
GET    /api/v1/users
GET    /api/v1/users/123
GET    /api/v1/users/123/orders
POST   /api/v1/users/123/orders

# Bad - verbs in URLs, inconsistent pluralization
GET    /api/v1/getUser/123
POST   /api/v1/createOrder
GET    /api/v1/user/123/order

HTTP Methods:

MethodPurposeIdempotentSafe
GETRead a resourceYesYes
POSTCreate a resourceNoNo
PUTReplace a resource entirelyYesNo
PATCHPartially update a resourceNoNo
DELETERemove a resourceYesNo

Idempotency matters because network failures cause retries. A PUT request with the same payload should produce the same result regardless of how many times it is sent. POST is not idempotent -- sending the same "create order" request twice creates two orders unless you implement idempotency keys.

Status Codes:

200 OK              - Successful read or update
201 Created         - Resource created (include Location header)
204 No Content      - Successful delete
400 Bad Request     - Client sent invalid data
401 Unauthorized    - Missing or invalid authentication
403 Forbidden       - Authenticated but not authorized
404 Not Found       - Resource does not exist
409 Conflict        - Resource state conflict (duplicate, version mismatch)
429 Too Many Requests - Rate limit exceeded (include Retry-After header)
500 Internal Error  - Server-side failure
503 Service Unavailable - Temporarily overloaded (include Retry-After)

Pagination:

python
# Cursor-based pagination (preferred for large datasets)
# GET /api/v1/orders?cursor=eyJpZCI6MTAwfQ&limit=20
 
@app.get("/api/v1/orders")
def list_orders(cursor: str = None, limit: int = 20):
    if cursor:
        decoded = base64.b64decode(cursor)
        last_id = json.loads(decoded)["id"]
        orders = db.query(Order).filter(Order.id > last_id).limit(limit + 1).all()
    else:
        orders = db.query(Order).limit(limit + 1).all()
 
    has_next = len(orders) > limit
    orders = orders[:limit]
 
    next_cursor = None
    if has_next:
        next_cursor = base64.b64encode(
            json.dumps({"id": orders[-1].id}).encode()
        ).decode()
 
    return {
        "data": orders,
        "pagination": {
            "next_cursor": next_cursor,
            "has_next": has_next
        }
    }

Cursor-based pagination avoids the offset problem where OFFSET 1000000 forces the database to scan and skip a million rows. As we discussed in Part 4, query performance degrades with large offsets.

GraphQL: When REST Falls Short

GraphQL solves specific problems that REST handles poorly:

Over-fetching: A mobile client needs just name and avatar, but the REST endpoint returns 30 fields. Under-fetching: Displaying a user profile requires 3 separate REST calls (user, posts, followers). Rapid iteration: Frontend teams can change their queries without waiting for backend API changes.

graphql
# Single GraphQL query replaces multiple REST calls
query UserProfile($id: ID!) {
  user(id: $id) {
    name
    avatar
    posts(last: 5) {
      title
      createdAt
    }
    followers {
      totalCount
    }
  }
}

GraphQL trade-offs:

  • Caching is harder (no URL-based HTTP caching, need persisted queries or APQ)
  • N+1 query problems require DataLoader pattern
  • Rate limiting is complex (one query can trigger thousands of database operations)
  • File uploads require workarounds
  • Learning curve for teams familiar with REST

Use GraphQL when you have multiple client types (web, mobile, third-party) with different data needs. Stick with REST for simple CRUD APIs, public APIs, or when HTTP caching is critical.

API Gateway Pattern

An API gateway sits between clients and your backend services, handling cross-cutting concerns:

Client --> API Gateway --> Service A
                      --> Service B
                      --> Service C

Gateway responsibilities:
- Authentication and authorization
- Rate limiting
- Request routing
- Response aggregation
- SSL termination
- Request/response transformation
- Logging and monitoring

This relates directly to the load balancing concepts from Part 3. The API gateway often sits behind a load balancer and routes requests to the appropriate microservice.

nginx
# Kong API Gateway configuration example
services:
  - name: user-service
    url: http://user-service:8080
    routes:
      - name: user-routes
        paths:
          - /api/v1/users
    plugins:
      - name: rate-limiting
        config:
          minute: 100
          policy: redis
          redis_host: redis
      - name: jwt
        config:
          secret_is_base64: false
      - name: cors
        config:
          origins: ["https://myapp.com"]
          methods: ["GET", "POST", "PUT", "DELETE"]

Rate Limiting Algorithms

Rate limiting protects your system from abuse, ensures fair usage, and prevents cascading failures. There are four main algorithms:

1. Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window.

python
import redis
import time
 
redis_client = redis.Redis()
 
def fixed_window_rate_limit(client_id: str, limit: int, window_seconds: int) -> bool:
    window_key = f"ratelimit:{client_id}:{int(time.time()) // window_seconds}"
    current = redis_client.incr(window_key)
    if current == 1:
        redis_client.expire(window_key, window_seconds)
    return current <= limit

Problem: Burst at window boundaries. A client can send 100 requests at 11:59:59 and 100 more at 12:00:00, effectively getting 200 requests in 2 seconds.

2. Sliding Window Log

Store timestamps of all requests. Count requests within the sliding window.

python
def sliding_window_log(client_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    key = f"ratelimit:log:{client_id}"
 
    pipe = redis_client.pipeline()
    pipe.zremrangebyscore(key, 0, now - window_seconds)  # Remove old entries
    pipe.zadd(key, {str(now): now})                       # Add current request
    pipe.zcard(key)                                       # Count entries
    pipe.expire(key, window_seconds)
    results = pipe.execute()
 
    return results[2] <= limit

Precise but memory-intensive. Every request timestamp is stored, which is impractical at high volumes.

3. Sliding Window Counter

Combines fixed window efficiency with sliding window accuracy by weighting the previous window:

python
def sliding_window_counter(client_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    current_window = int(now) // window_seconds
    previous_window = current_window - 1
 
    # How far we are into the current window (0.0 to 1.0)
    elapsed_ratio = (now % window_seconds) / window_seconds
 
    current_key = f"ratelimit:{client_id}:{current_window}"
    previous_key = f"ratelimit:{client_id}:{previous_window}"
 
    current_count = int(redis_client.get(current_key) or 0)
    previous_count = int(redis_client.get(previous_key) or 0)
 
    # Weighted count: full current + proportional previous
    weighted_count = current_count + previous_count * (1 - elapsed_ratio)
 
    if weighted_count >= limit:
        return False
 
    redis_client.incr(current_key)
    redis_client.expire(current_key, window_seconds * 2)
    return True

This is a practical choice for most production systems -- low memory, good accuracy.

4. Token Bucket

A bucket holds tokens up to a maximum capacity. Each request consumes one token. Tokens are added at a fixed rate. If the bucket is empty, the request is rejected.

python
def token_bucket(client_id: str, capacity: int, refill_rate: float) -> bool:
    """
    capacity: max burst size
    refill_rate: tokens added per second
    """
    key = f"ratelimit:bucket:{client_id}"
    now = time.time()
 
    bucket = redis_client.hmget(key, "tokens", "last_refill")
    tokens = float(bucket[0]) if bucket[0] else capacity
    last_refill = float(bucket[1]) if bucket[1] else now
 
    # Add tokens based on elapsed time
    elapsed = now - last_refill
    tokens = min(capacity, tokens + elapsed * refill_rate)
 
    if tokens < 1:
        return False
 
    tokens -= 1
    redis_client.hmset(key, {"tokens": tokens, "last_refill": now})
    redis_client.expire(key, int(capacity / refill_rate) + 1)
    return True

Token bucket is the most widely used algorithm in production because it naturally handles burst traffic while enforcing a long-term average rate. AWS API Gateway, Stripe, and GitHub all use variations of token bucket.

OAuth2 Flows

OAuth2 is the standard for API authentication and authorization. The flow you choose depends on the client type:

Authorization Code Flow (web apps with a backend):

1. Client redirects user to authorization server
2. User authenticates and consents
3. Authorization server redirects back with an authorization code
4. Client exchanges code for access token (server-to-server, code is short-lived)
5. Client uses access token to call APIs

Authorization Code with PKCE (mobile/SPA): Same as above, but the client generates a code verifier and challenge to prevent authorization code interception. This is the recommended flow for all public clients.

Client Credentials (service-to-service):

python
# Service-to-service authentication
import requests
 
response = requests.post("https://auth.example.com/oauth/token", data={
    "grant_type": "client_credentials",
    "client_id": "service-orders",
    "client_secret": "secret",
    "scope": "read:users"
})
 
access_token = response.json()["access_token"]
 
# Use the token
headers = {"Authorization": f"Bearer {access_token}"}
users = requests.get("https://api.example.com/users", headers=headers)

API Key Management

API keys are simpler than OAuth2 but less secure. They are appropriate for server-to-server calls where the key can be kept secret.

Best practices:

  • Generate keys with sufficient entropy (at least 32 random bytes, base64-encoded)
  • Hash keys before storing them (treat them like passwords)
  • Support key rotation with overlapping validity periods
  • Scope keys to specific permissions and rate limits
  • Include a prefix for easy identification (sk_live_, pk_test_)

API Versioning Strategies

URL path versioning: /api/v1/users -- Simple, explicit, easy to route. Most common approach.

Header versioning: Accept: application/vnd.myapi.v2+json -- Cleaner URLs but harder to test in a browser.

Query parameter: /api/users?version=2 -- Easy to implement but pollutes the URL.

No versioning (evolve in place): Add fields, never remove or rename. Use feature flags. Works for internal APIs.

The pragmatic choice for most teams is URL path versioning. It is explicit, cacheable, and easy to route at the API gateway level.

Practical Implementation

Here is a complete rate-limited API endpoint using FastAPI:

python
from fastapi import FastAPI, Request, HTTPException, Depends
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import jwt
import redis
import time
 
app = FastAPI()
security = HTTPBearer()
redis_client = redis.Redis(host='localhost', port=6379)
 
SECRET_KEY = "your-secret-key"
 
def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
    try:
        payload = jwt.decode(credentials.credentials, SECRET_KEY, algorithms=["HS256"])
        return payload
    except jwt.ExpiredSignatureError:
        raise HTTPException(status_code=401, detail="Token expired")
    except jwt.InvalidTokenError:
        raise HTTPException(status_code=401, detail="Invalid token")
 
def rate_limit(request: Request, user=Depends(verify_token)):
    client_id = user.get("sub", request.client.host)
    limit = user.get("rate_limit", 100)  # Per-user configurable limits
 
    if not sliding_window_counter(client_id, limit, window_seconds=60):
        raise HTTPException(
            status_code=429,
            detail="Rate limit exceeded",
            headers={"Retry-After": "60"}
        )
    return user
 
@app.get("/api/v1/orders")
def list_orders(user=Depends(rate_limit)):
    # Authenticated, rate-limited endpoint
    return {"orders": get_user_orders(user["sub"])}

Trade-offs and Decision Framework

ConcernSimple ApproachProduction Approach
AuthAPI keysOAuth2 with PKCE + JWTs
Rate LimitingFixed window per IPToken bucket per user + per IP with Redis
VersioningURL pathURL path + sunset headers
PaginationOffset-basedCursor-based
Error FormatStatus code onlyRFC 7807 Problem Details
DocumentationManualOpenAPI/Swagger auto-generated

Common Interview Questions

Q: How would you design rate limiting for a global API serving millions of requests per second? A: Use a token bucket algorithm backed by Redis. Deploy Redis clusters in each region for low latency. Use a local in-memory counter as a first pass (to reduce Redis calls) with periodic synchronization. Set different limits per tier (free, pro, enterprise). Return 429 with Retry-After header. For distributed rate limiting across regions, accept slight over-limit (eventually consistent counters) or use a centralized counter with higher latency.

Q: A client needs data from 5 microservices to render one page. How do you optimize this? A: Use an API gateway to aggregate responses server-side (Backend for Frontend pattern). Alternatively, use GraphQL to let the client specify exactly what it needs in one request. For frequently accessed combinations, create a dedicated composite endpoint. Ensure downstream services are cached as described in Part 5.

Q: How do you handle API versioning when you need to make a breaking change? A: Deploy the new version alongside the old one. Communicate the deprecation timeline (minimum 6-12 months for public APIs). Add Sunset and Deprecation headers to old version responses. Monitor usage of the old version. Provide a migration guide. Only decommission after confirming zero or negligible traffic.

Q: How do you prevent API abuse beyond rate limiting? A: Layer multiple defenses: rate limiting per user and per IP, request size limits, input validation, authentication on all endpoints, CORS configuration, WAF rules for common attack patterns, anomaly detection on usage patterns, and IP reputation scoring.

What's Next

With APIs secured and rate-limited, Part 9: CAP Theorem and Distributed Consensus dives into the fundamental trade-offs that govern how distributed systems maintain consistency when things go wrong.

FAQ

What are the most common rate limiting algorithms?

The four main algorithms are token bucket (bursty traffic), leaky bucket (smooth output), fixed window counter (simple), and sliding window log (precise). Token bucket is the most widely used in production.

Should I use REST or GraphQL for my API?

Use REST for simple CRUD operations with well-defined resources. Use GraphQL when clients need flexible queries, you want to avoid over-fetching, or your frontend teams need to iterate without backend changes.

How do I version my API without breaking existing clients?

Use URL path versioning (/v1/users) for simplicity or header-based versioning for cleaner URLs. Always maintain backward compatibility within a version and give clients a deprecation timeline for older versions.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

SH

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Related Articles

Design an E-Commerce Order Processing System
Jan 10, 202612 min read
System Design
E-Commerce
Saga Pattern

Design an E-Commerce Order Processing System

Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.

Monitoring, Observability, and Site Reliability
Dec 10, 20259 min read
System Design
Observability
Monitoring

Monitoring, Observability, and Site Reliability

Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.

CAP Theorem and Distributed Consensus
Nov 12, 202510 min read
System Design
CAP Theorem
Distributed Systems

CAP Theorem and Distributed Consensus

Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.