Design an E-Commerce Order Processing System
Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.
Tags
Design an E-Commerce Order Processing System
This post applies concepts from the System Design from Zero to Hero series.
TL;DR
An e-commerce order system uses the saga pattern to coordinate inventory, payment, and fulfillment services, with compensating transactions to handle failures gracefully. The order follows a state machine from cart through payment to delivery, with each transition triggering events that downstream services consume. Inventory reservation uses optimistic locking to prevent overselling, and idempotency keys ensure payment retries do not charge customers twice. The entire system is event-driven, with each service owning its own data and communicating through a message broker.
Requirements
Functional Requirements
- ›Cart management — Users can add/remove items, update quantities, and view their cart.
- ›Order placement — Convert a cart into an order with address, payment method, and delivery preference.
- ›Payment processing — Charge the customer and handle payment failures, refunds, and retries.
- ›Inventory management — Track stock levels, reserve inventory during checkout, and prevent overselling.
- ›Order tracking — Display real-time order status (confirmed, processing, shipped, delivered).
- ›Order cancellation — Allow cancellation before shipment, triggering refund and inventory release.
Non-Functional Requirements
- ›Consistency — Inventory must never go negative. A customer must never be charged without an order being created.
- ›Idempotency — Payment retries must not result in double charges.
- ›Availability — The checkout flow must be available 99.99% of the time during peak traffic.
- ›Scalability — Handle 10,000 orders per second during flash sales.
- ›Fault tolerance — If any service fails mid-transaction, the system must recover to a consistent state.
Back-of-Envelope Estimation
Assume a large e-commerce platform during a flash sale:
- ›Peak order rate: 10,000 orders/second
- ›Average items per order: 2.5 items → 25,000 inventory operations/second
- ›Payment processing: 10,000 payment requests/second (each taking 1-3 seconds with payment provider)
- ›Cart reads: 100,000/second (users browsing, much higher than order rate)
- ›Order storage: ~2 KB per order → 10,000 * 2KB = 20 MB/second → ~1.7 TB/day during peak
- ›Inventory updates: Must serialize per-SKU to prevent overselling → partition by product_id
High-Level Design
Client → API Gateway → Cart Service
↓
Order Service → Event Bus (Kafka)
↙ ↓ ↘
Payment Inventory Fulfillment
Service Service Service
↓ ↓ ↓
Payment Inventory Shipping
Provider Database Provider
Order placement flow:
- ›User clicks "Place Order." The API gateway calls the Order Service.
- ›The Order Service creates an order in
PENDINGstate and publishes anOrderCreatedevent. - ›The Inventory Service consumes the event, reserves stock, and publishes
InventoryReserved. - ›The Payment Service consumes the event, charges the customer, and publishes
PaymentCompleted. - ›The Order Service consumes both events and transitions the order to
CONFIRMED. - ›The Fulfillment Service picks up the confirmed order for shipping.
If any step fails, compensating transactions undo the previous steps (release inventory, refund payment).
Detailed Design
Order State Machine
The order follows a well-defined state machine. Every transition is triggered by an event, and only valid transitions are allowed.
┌──────────────┐
│ CREATED │
└──────┬───────┘
│ inventory reserved
┌──────▼───────┐
│ RESERVED │──── inventory failed ──→ CANCELLED
└──────┬───────┘
│ payment completed
┌──────▼───────┐
│ CONFIRMED │──── payment failed ──→ RESERVED (retry)
└──────┬───────┘ │
│ shipped release inventory
┌──────▼───────┐ │
│ SHIPPED │ CANCELLED
└──────┬───────┘
│ delivered
┌──────▼───────┐
│ DELIVERED │
└──────────────┘
Each state transition is persisted as an event in an append-only order events table. This gives a full audit trail and enables event sourcing if needed.
class OrderState(Enum):
CREATED = "created"
RESERVED = "reserved"
CONFIRMED = "confirmed"
SHIPPED = "shipped"
DELIVERED = "delivered"
CANCELLED = "cancelled"
REFUND_PENDING = "refund_pending"
REFUNDED = "refunded"
VALID_TRANSITIONS = {
OrderState.CREATED: [OrderState.RESERVED, OrderState.CANCELLED],
OrderState.RESERVED: [OrderState.CONFIRMED, OrderState.CANCELLED],
OrderState.CONFIRMED: [OrderState.SHIPPED, OrderState.CANCELLED],
OrderState.SHIPPED: [OrderState.DELIVERED],
OrderState.DELIVERED: [OrderState.REFUND_PENDING],
OrderState.REFUND_PENDING: [OrderState.REFUNDED],
}
def transition_order(order, new_state, event_data):
if new_state not in VALID_TRANSITIONS.get(order.state, []):
raise InvalidTransitionError(
f"Cannot transition from {order.state} to {new_state}"
)
order.state = new_state
order.updated_at = now()
order_events.append(OrderEvent(
order_id=order.id,
from_state=order.state,
to_state=new_state,
event_data=event_data,
timestamp=now()
))Saga Pattern for Distributed Transactions
An order involves multiple services: Inventory, Payment, and Fulfillment. A traditional database transaction cannot span these services. The saga pattern breaks the process into a sequence of local transactions, each with a compensating action that undoes it on failure. For event-driven architecture patterns, see Part 6: Message Queues and Event-Driven Architecture.
Choreography-based saga (event-driven):
Each service listens for events and reacts independently. No central coordinator.
1. Order Service → publishes OrderCreated
2. Inventory Service → consumes OrderCreated
→ reserves inventory
→ publishes InventoryReserved (or InventoryFailed)
3. Payment Service → consumes InventoryReserved
→ charges payment
→ publishes PaymentCompleted (or PaymentFailed)
4. Order Service → consumes PaymentCompleted
→ updates order to CONFIRMED
→ publishes OrderConfirmed
5. Fulfillment → consumes OrderConfirmed
→ initiates shipping
Compensating transactions (on failure):
PaymentFailed event:
→ Inventory Service releases reserved stock
→ Order Service marks order as CANCELLED
→ Notification Service sends failure email to user
InventoryFailed event:
→ Order Service marks order as CANCELLED
→ Notification Service sends "out of stock" email
Orchestration-based saga (alternative):
A central Order Saga Orchestrator coordinates the steps explicitly:
class OrderSagaOrchestrator:
def execute(self, order):
try:
# Step 1: Reserve inventory
inventory_result = inventory_service.reserve(
order.items, order.id
)
if not inventory_result.success:
self.cancel_order(order, "Inventory unavailable")
return
# Step 2: Process payment
payment_result = payment_service.charge(
order.user_id, order.total,
idempotency_key=f"order-{order.id}"
)
if not payment_result.success:
# Compensate: release inventory
inventory_service.release(order.items, order.id)
self.cancel_order(order, "Payment failed")
return
# Step 3: Confirm order
order.transition(OrderState.CONFIRMED)
event_bus.publish("OrderConfirmed", order)
except Exception as e:
# Compensate all completed steps
self.compensate(order, completed_steps)Choreography vs Orchestration:
- ›Choreography is simpler for small sagas (3-4 steps) but becomes hard to trace and debug as complexity grows.
- ›Orchestration centralizes the flow logic, making it easier to understand, test, and monitor. Preferred for complex multi-step flows.
Inventory Reservation: Pessimistic vs Optimistic Locking
The core inventory challenge is preventing overselling: two users trying to buy the last item must not both succeed. For database-level locking strategies, see Part 7: Sharding and Partitioning.
Pessimistic locking (SELECT FOR UPDATE):
Lock the inventory row when reading it. Other transactions must wait until the lock is released.
BEGIN;
SELECT quantity FROM inventory WHERE product_id = 'SKU123' FOR UPDATE;
-- If quantity >= requested_amount:
UPDATE inventory SET quantity = quantity - 1,
reserved = reserved + 1
WHERE product_id = 'SKU123';
COMMIT;Pros: Guarantees no overselling. Cons: Lock contention. During a flash sale, thousands of concurrent requests for the same product will serialize, creating a bottleneck. Throughput drops to the database's lock processing speed.
Optimistic locking (version-based):
Read the inventory row with its version number. On update, check that the version has not changed. If it has, retry.
-- Read
SELECT quantity, version FROM inventory WHERE product_id = 'SKU123';
-- quantity = 10, version = 42
-- Update with version check
UPDATE inventory
SET quantity = quantity - 1, reserved = reserved + 1, version = version + 1
WHERE product_id = 'SKU123' AND version = 42;
-- If rows_affected = 0, version changed → retryPros: No locks held. Higher throughput under moderate contention. Cons: Under high contention (flash sales), most retries fail, wasting resources. Works well when conflicts are rare.
Recommended approach for flash sales:
Use a single-partition message queue per product. All purchase requests for a given SKU are routed to the same queue partition and processed serially. This eliminates lock contention entirely while maintaining strict ordering.
Purchase requests for SKU-123 → Kafka partition (key=SKU-123) → Single consumer
The consumer processes one request at a time: check stock, decrement, confirm. Since there is only one consumer per partition, there is no concurrent access and no locking needed.
Reservation with TTL:
When a user adds an item to their cart, the system does not decrement stock immediately. Instead, it creates a time-limited reservation (e.g., 10 minutes). If the user does not complete checkout within the TTL, the reservation expires and the stock is released back. This prevents cart abandonment from permanently reducing available inventory.
def reserve_inventory(product_id: str, quantity: int, order_id: str):
reservation = {
"product_id": product_id,
"quantity": quantity,
"order_id": order_id,
"expires_at": now() + timedelta(minutes=10)
}
# Atomic: decrement available, increment reserved
result = db.execute("""
UPDATE inventory
SET available = available - %s, reserved = reserved + %s
WHERE product_id = %s AND available >= %s
""", (quantity, quantity, product_id, quantity))
if result.rows_affected == 0:
raise InsufficientStockError()
# Schedule TTL expiration
scheduler.schedule_at(reservation["expires_at"],
release_reservation, order_id)Idempotency Keys for Payments
Payment processing must be idempotent. If the client retries a payment request (due to a timeout or network error), the payment provider must not charge the customer again. The solution is an idempotency key: a unique identifier sent with each payment request.
def process_payment(order_id: str, amount: Decimal, payment_method: str):
idempotency_key = f"payment-{order_id}"
# Check if this payment was already processed
existing = db.query(
"SELECT * FROM payments WHERE idempotency_key = %s",
idempotency_key
)
if existing:
return existing # Return the previous result
# Create payment record before calling provider
payment = db.insert("payments", {
"idempotency_key": idempotency_key,
"order_id": order_id,
"amount": amount,
"status": "PENDING",
"created_at": now()
})
try:
# Call payment provider with idempotency key
result = payment_provider.charge(
amount=amount,
payment_method=payment_method,
idempotency_key=idempotency_key # Provider also deduplicates
)
payment.update(status="COMPLETED", provider_id=result.id)
return payment
except PaymentDeclinedError:
payment.update(status="FAILED")
raiseTwo layers of idempotency:
- ›Application layer: The Order Service checks its own database before calling the payment provider.
- ›Provider layer: Stripe, PayPal, and other providers accept an
idempotency_keyparameter and deduplicate on their end.
Both layers are necessary because network failures can cause the application to miss the provider's response. The provider might have charged successfully, but the application never received the confirmation. On retry, the provider's idempotency check prevents a double charge.
Event-Driven Order Updates
Every state change in the order lifecycle is published as an event to Kafka. Downstream services consume events relevant to them:
{
"event_type": "OrderConfirmed",
"order_id": "ord_abc123",
"user_id": "usr_456",
"items": [
{"product_id": "SKU-123", "quantity": 2, "price": 29.99}
],
"total": 59.98,
"timestamp": "2025-08-05T14:23:00Z"
}Event consumers:
- ›Fulfillment Service — Listens for
OrderConfirmedto initiate picking and packing. - ›Notification Service — Listens for all order events to send status update emails/push.
- ›Analytics Service — Listens for
OrderConfirmedandOrderCancelledfor revenue tracking. - ›Search/Recommendation Service — Listens for order events to update purchase history.
Event ordering guarantee: Use Kafka with order_id as the partition key. All events for a given order go to the same partition and are processed in order. This ensures a consumer never processes OrderShipped before OrderConfirmed.
Cart Service Design
The cart is a high-read, high-write, ephemeral data structure. It does not need the same durability guarantees as orders.
Storage options:
- ›Redis (recommended for logged-in users): Store the cart as a Redis hash. Fast reads and writes. Set a TTL of 30 days for cart expiration. If Redis goes down, the cart is lost, but that is acceptable for most e-commerce sites.
- ›Database (for persistence): Store cart data in a database for users who expect their cart to persist across devices and sessions. Use Redis as a write-through cache in front of the database.
- ›Client-side (for anonymous users): Store the cart in a cookie or local storage. Merge with the server-side cart upon login.
Redis Hash — cart:{user_id}
Fields:
SKU-123: {"quantity": 2, "price": 29.99, "added_at": "..."}
SKU-456: {"quantity": 1, "price": 49.99, "added_at": "..."}
TTL: 30 days
Cart-to-order conversion:
When the user clicks "Place Order," the Cart Service reads the cart, validates prices against the current catalog (prices may have changed since the item was added), validates inventory availability, and passes the validated cart to the Order Service. The Order Service creates the order and deletes the cart.
Price consistency: The price shown in the cart may differ from the current price at checkout time. The system should re-validate prices at order placement and display any changes to the user before confirming.
Data Model
Orders Table
| Column | Type | Description |
|---|---|---|
| order_id | UUID | Primary key |
| user_id | UUID | Customer who placed the order |
| status | VARCHAR | created/reserved/confirmed/shipped/delivered/cancelled |
| total_amount | DECIMAL | Order total |
| currency | VARCHAR(3) | Currency code (USD, EUR) |
| shipping_address | JSONB | Delivery address |
| payment_method | VARCHAR | Payment method identifier |
| created_at | TIMESTAMP | Order creation time |
| updated_at | TIMESTAMP | Last status change |
Order Items Table
| Column | Type | Description |
|---|---|---|
| order_id | UUID | Foreign key to orders |
| product_id | VARCHAR | SKU identifier |
| quantity | INT | Number of units |
| unit_price | DECIMAL | Price at time of purchase |
| subtotal | DECIMAL | quantity * unit_price |
Order Events Table (Append-Only Audit Log)
| Column | Type | Description |
|---|---|---|
| event_id | UUID | Primary key |
| order_id | UUID | Foreign key to orders |
| event_type | VARCHAR | OrderCreated, PaymentCompleted, etc. |
| from_state | VARCHAR | Previous state |
| to_state | VARCHAR | New state |
| event_data | JSONB | Additional event metadata |
| created_at | TIMESTAMP | Event timestamp |
Inventory Table
| Column | Type | Description |
|---|---|---|
| product_id | VARCHAR | Primary key (SKU) |
| available | INT | Available for purchase |
| reserved | INT | Reserved by pending orders |
| warehouse_id | VARCHAR | Physical location |
| version | INT | Optimistic lock version |
| updated_at | TIMESTAMP | Last update time |
Payments Table
| Column | Type | Description |
|---|---|---|
| payment_id | UUID | Primary key |
| order_id | UUID | Foreign key to orders |
| idempotency_key | VARCHAR | Unique key for deduplication |
| amount | DECIMAL | Charged amount |
| currency | VARCHAR(3) | Currency code |
| status | VARCHAR | pending/completed/failed/refunded |
| provider_id | VARCHAR | External provider transaction ID |
| created_at | TIMESTAMP | Payment initiation time |
| completed_at | TIMESTAMP | Payment completion time |
Scaling Considerations
Order Service scaling: The Order Service is stateless and scales horizontally. Orders are partitioned by order_id in the database. For sharding strategies, see Part 7: Sharding and Partitioning.
Inventory Service — the hot spot problem: During flash sales, a single popular product receives thousands of concurrent purchase requests. Solutions:
- ›Queue-based serialization: Route all requests for a product to a single Kafka partition. Process them sequentially.
- ›Inventory sharding: Split inventory for a hot product across multiple "virtual inventory pools." Each pool handles a fraction of the total stock. Requests are distributed across pools.
- ›Pre-deduct with reconciliation: Deduct inventory optimistically and reconcile asynchronously. Risk of slight overselling, handled by backorders.
Payment Service: Payment provider APIs are the bottleneck (typically 1-3 seconds per call). Use connection pooling, circuit breakers (to avoid hammering a failing provider), and multiple payment providers for failover.
Event bus (Kafka) partitioning: Partition the order events topic by order_id to guarantee per-order event ordering. The inventory events topic should be partitioned by product_id for per-product ordering.
Database choices: Use PostgreSQL for orders and payments (ACID transactions needed). Use a separate database for inventory if write throughput demands it. Use Redis for cart storage.
Trade-offs and Alternatives
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| Saga pattern | Choreography | Orchestration | Orchestration for complex flows (5+ steps) |
| Inventory locking | Pessimistic | Optimistic | Optimistic for normal traffic, queue-based for flash sales |
| Cart storage | Redis | Database | Redis with database fallback for persistence |
| Payment retry | Immediate | Exponential backoff | Backoff with idempotency keys |
| Event delivery | At-least-once | Exactly-once | At-least-once with idempotent consumers |
Why not a monolithic transaction? A single distributed transaction (two-phase commit) across Inventory, Payment, and Fulfillment services would lock resources across all three databases simultaneously. This does not scale: a slow payment provider would hold locks on inventory rows, blocking other users from purchasing. The saga pattern releases locks immediately after each local transaction, allowing services to scale independently.
Why not event sourcing for everything? Event sourcing (rebuilding state by replaying events) is powerful for audit trails and debugging, but it adds complexity for simple CRUD operations like cart management. Use event sourcing for the order lifecycle (where the audit trail is valuable) and traditional CRUD for the cart and user preferences.
Handling partial failures: If payment succeeds but the inventory service crashes before releasing the reservation, the system has an inconsistency. The saga orchestrator must track which steps completed and run compensating transactions. A periodic reconciliation job compares order states across services and flags inconsistencies for manual or automated resolution.
FAQ
Why use the saga pattern instead of distributed transactions in e-commerce?
Distributed transactions (2PC) lock resources and do not scale. The saga pattern breaks the order flow into local transactions with compensating actions, allowing each service to scale independently while maintaining eventual consistency. In a 2PC approach, the payment provider, inventory database, and order database all hold locks simultaneously. If the payment provider takes 3 seconds to respond, the inventory row is locked for 3 seconds, blocking other purchases of that product. With sagas, the inventory is reserved (local transaction completes and releases the lock immediately), then payment is processed separately. If payment fails, a compensating transaction releases the inventory reservation.
How do you prevent overselling inventory in a high-traffic system?
Use optimistic locking with version checks, reserve inventory with TTL-based holds during checkout, and process orders through a single-partition queue per product to serialize inventory updates without distributed locks. The TTL-based reservation is critical: when a user begins checkout, the system temporarily reserves their items for 10 minutes. If they do not complete the purchase, the reservation expires automatically and the stock becomes available again. During flash sales, the queue-based approach is most effective because it eliminates contention entirely by processing one purchase at a time per product.
How should the system handle payment failures during order processing?
Implement compensating transactions that release reserved inventory and notify the user. Use idempotency keys for payment retries, store payment state in a state machine, and support multiple payment method fallbacks. The idempotency key is the most critical piece: it ensures that if the client retries a failed payment request (because the response was lost due to a network error), the payment provider recognizes the duplicate and returns the original result instead of charging again. The order stays in RESERVED state during payment retry, and the inventory reservation TTL is extended to prevent the reserved items from being released while the payment is being retried.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
Monitoring, Observability, and Site Reliability
Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.
CAP Theorem and Distributed Consensus
Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.
Design a Rate Limiter: Algorithms and Implementation
Build a distributed rate limiter using token bucket, sliding window, and leaky bucket algorithms. Covers Redis-based implementation and API gateway integration.