Blog/System Design/Design an E-Commerce Order Processing System

POST

January 10, 2026

LAST UPDATEDJanuary 10, 2026

Design an E-Commerce Order Processing System

Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.

Design an E-Commerce Order Processing System

This post applies concepts from the System Design from Zero to Hero series.

TL;DR

An e-commerce order system uses the saga pattern to coordinate inventory, payment, and fulfillment services, with compensating transactions to handle failures gracefully. The order follows a state machine from cart through payment to delivery, with each transition triggering events that downstream services consume. Inventory reservation uses optimistic locking to prevent overselling, and idempotency keys ensure payment retries do not charge customers twice. The entire system is event-driven, with each service owning its own data and communicating through a message broker.

Requirements

Functional Requirements

›Cart management — Users can add/remove items, update quantities, and view their cart.
›Order placement — Convert a cart into an order with address, payment method, and delivery preference.
›Payment processing — Charge the customer and handle payment failures, refunds, and retries.
›Inventory management — Track stock levels, reserve inventory during checkout, and prevent overselling.
›Order tracking — Display real-time order status (confirmed, processing, shipped, delivered).
›Order cancellation — Allow cancellation before shipment, triggering refund and inventory release.

Non-Functional Requirements

›Consistency — Inventory must never go negative. A customer must never be charged without an order being created.
›Idempotency — Payment retries must not result in double charges.
›Availability — The checkout flow must be available 99.99% of the time during peak traffic.
›Scalability — Handle 10,000 orders per second during flash sales.
›Fault tolerance — If any service fails mid-transaction, the system must recover to a consistent state.

Back-of-Envelope Estimation

Assume a large e-commerce platform during a flash sale:

›Peak order rate: 10,000 orders/second
›Average items per order: 2.5 items → 25,000 inventory operations/second
›Payment processing: 10,000 payment requests/second (each taking 1-3 seconds with payment provider)
›Cart reads: 100,000/second (users browsing, much higher than order rate)
›Order storage: ~2 KB per order → 10,000 * 2KB = 20 MB/second → ~1.7 TB/day during peak
›Inventory updates: Must serialize per-SKU to prevent overselling → partition by product_id

High-Level Design

Client → API Gateway → Cart Service
                          ↓
                    Order Service → Event Bus (Kafka)
                      ↙    ↓      ↘
              Payment    Inventory   Fulfillment
              Service    Service     Service
                ↓          ↓           ↓
            Payment     Inventory    Shipping
            Provider    Database     Provider

Order placement flow:

›User clicks "Place Order." The API gateway calls the Order Service.
›The Order Service creates an order in PENDING state and publishes an OrderCreated event.
›The Inventory Service consumes the event, reserves stock, and publishes InventoryReserved.
›The Payment Service consumes the event, charges the customer, and publishes PaymentCompleted.
›The Order Service consumes both events and transitions the order to CONFIRMED.
›The Fulfillment Service picks up the confirmed order for shipping.

If any step fails, compensating transactions undo the previous steps (release inventory, refund payment).

Detailed Design

Order State Machine

The order follows a well-defined state machine. Every transition is triggered by an event, and only valid transitions are allowed.

                    ┌──────────────┐
                    │   CREATED    │
                    └──────┬───────┘
                           │ inventory reserved
                    ┌──────▼───────┐
                    │   RESERVED   │──── inventory failed ──→ CANCELLED
                    └──────┬───────┘
                           │ payment completed
                    ┌──────▼───────┐
                    │  CONFIRMED   │──── payment failed ──→ RESERVED (retry)
                    └──────┬───────┘                              │
                           │ shipped                        release inventory
                    ┌──────▼───────┐                              │
                    │   SHIPPED    │                          CANCELLED
                    └──────┬───────┘
                           │ delivered
                    ┌──────▼───────┐
                    │  DELIVERED   │
                    └──────────────┘

Each state transition is persisted as an event in an append-only order events table. This gives a full audit trail and enables event sourcing if needed.

python

class OrderState(Enum):
    CREATED = "created"
    RESERVED = "reserved"
    CONFIRMED = "confirmed"
    SHIPPED = "shipped"
    DELIVERED = "delivered"
    CANCELLED = "cancelled"
    REFUND_PENDING = "refund_pending"
    REFUNDED = "refunded"
 
VALID_TRANSITIONS = {
    OrderState.CREATED: [OrderState.RESERVED, OrderState.CANCELLED],
    OrderState.RESERVED: [OrderState.CONFIRMED, OrderState.CANCELLED],
    OrderState.CONFIRMED: [OrderState.SHIPPED, OrderState.CANCELLED],
    OrderState.SHIPPED: [OrderState.DELIVERED],
    OrderState.DELIVERED: [OrderState.REFUND_PENDING],
    OrderState.REFUND_PENDING: [OrderState.REFUNDED],
}
 
def transition_order(order, new_state, event_data):
    if new_state not in VALID_TRANSITIONS.get(order.state, []):
        raise InvalidTransitionError(
            f"Cannot transition from {order.state} to {new_state}"
        )
    order.state = new_state
    order.updated_at = now()
    order_events.append(OrderEvent(
        order_id=order.id,
        from_state=order.state,
        to_state=new_state,
        event_data=event_data,
        timestamp=now()
    ))

Saga Pattern for Distributed Transactions

An order involves multiple services: Inventory, Payment, and Fulfillment. A traditional database transaction cannot span these services. The saga pattern breaks the process into a sequence of local transactions, each with a compensating action that undoes it on failure. For event-driven architecture patterns, see Part 6: Message Queues and Event-Driven Architecture.

Choreography-based saga (event-driven):

Each service listens for events and reacts independently. No central coordinator.

1. Order Service    → publishes OrderCreated
2. Inventory Service → consumes OrderCreated
                     → reserves inventory
                     → publishes InventoryReserved (or InventoryFailed)
3. Payment Service  → consumes InventoryReserved
                     → charges payment
                     → publishes PaymentCompleted (or PaymentFailed)
4. Order Service    → consumes PaymentCompleted
                     → updates order to CONFIRMED
                     → publishes OrderConfirmed
5. Fulfillment      → consumes OrderConfirmed
                     → initiates shipping

Compensating transactions (on failure):

PaymentFailed event:
  → Inventory Service releases reserved stock
  → Order Service marks order as CANCELLED
  → Notification Service sends failure email to user

InventoryFailed event:
  → Order Service marks order as CANCELLED
  → Notification Service sends "out of stock" email

Orchestration-based saga (alternative):

A central Order Saga Orchestrator coordinates the steps explicitly:

python

class OrderSagaOrchestrator:
    def execute(self, order):
        try:
            # Step 1: Reserve inventory
            inventory_result = inventory_service.reserve(
                order.items, order.id
            )
            if not inventory_result.success:
                self.cancel_order(order, "Inventory unavailable")
                return
 
            # Step 2: Process payment
            payment_result = payment_service.charge(
                order.user_id, order.total,
                idempotency_key=f"order-{order.id}"
            )
            if not payment_result.success:
                # Compensate: release inventory
                inventory_service.release(order.items, order.id)
                self.cancel_order(order, "Payment failed")
                return
 
            # Step 3: Confirm order
            order.transition(OrderState.CONFIRMED)
            event_bus.publish("OrderConfirmed", order)
 
        except Exception as e:
            # Compensate all completed steps
            self.compensate(order, completed_steps)

Choreography vs Orchestration:

›Choreography is simpler for small sagas (3-4 steps) but becomes hard to trace and debug as complexity grows.
›Orchestration centralizes the flow logic, making it easier to understand, test, and monitor. Preferred for complex multi-step flows.

Inventory Reservation: Pessimistic vs Optimistic Locking

The core inventory challenge is preventing overselling: two users trying to buy the last item must not both succeed. For database-level locking strategies, see Part 7: Sharding and Partitioning.

Pessimistic locking (SELECT FOR UPDATE):

Lock the inventory row when reading it. Other transactions must wait until the lock is released.

sql

BEGIN;
SELECT quantity FROM inventory WHERE product_id = 'SKU123' FOR UPDATE;
-- If quantity >= requested_amount:
UPDATE inventory SET quantity = quantity - 1,
       reserved = reserved + 1
WHERE product_id = 'SKU123';
COMMIT;

Pros: Guarantees no overselling. Cons: Lock contention. During a flash sale, thousands of concurrent requests for the same product will serialize, creating a bottleneck. Throughput drops to the database's lock processing speed.

Optimistic locking (version-based):

Read the inventory row with its version number. On update, check that the version has not changed. If it has, retry.

sql

-- Read
SELECT quantity, version FROM inventory WHERE product_id = 'SKU123';
-- quantity = 10, version = 42
 
-- Update with version check
UPDATE inventory
SET quantity = quantity - 1, reserved = reserved + 1, version = version + 1
WHERE product_id = 'SKU123' AND version = 42;
-- If rows_affected = 0, version changed → retry

Pros: No locks held. Higher throughput under moderate contention. Cons: Under high contention (flash sales), most retries fail, wasting resources. Works well when conflicts are rare.

Recommended approach for flash sales:

Use a single-partition message queue per product. All purchase requests for a given SKU are routed to the same queue partition and processed serially. This eliminates lock contention entirely while maintaining strict ordering.

Purchase requests for SKU-123 → Kafka partition (key=SKU-123) → Single consumer

The consumer processes one request at a time: check stock, decrement, confirm. Since there is only one consumer per partition, there is no concurrent access and no locking needed.

Reservation with TTL:

When a user adds an item to their cart, the system does not decrement stock immediately. Instead, it creates a time-limited reservation (e.g., 10 minutes). If the user does not complete checkout within the TTL, the reservation expires and the stock is released back. This prevents cart abandonment from permanently reducing available inventory.

python

def reserve_inventory(product_id: str, quantity: int, order_id: str):
    reservation = {
        "product_id": product_id,
        "quantity": quantity,
        "order_id": order_id,
        "expires_at": now() + timedelta(minutes=10)
    }
    # Atomic: decrement available, increment reserved
    result = db.execute("""
        UPDATE inventory
        SET available = available - %s, reserved = reserved + %s
        WHERE product_id = %s AND available >= %s
    """, (quantity, quantity, product_id, quantity))
 
    if result.rows_affected == 0:
        raise InsufficientStockError()
 
    # Schedule TTL expiration
    scheduler.schedule_at(reservation["expires_at"],
                         release_reservation, order_id)

Idempotency Keys for Payments

Payment processing must be idempotent. If the client retries a payment request (due to a timeout or network error), the payment provider must not charge the customer again. The solution is an idempotency key: a unique identifier sent with each payment request.

python

def process_payment(order_id: str, amount: Decimal, payment_method: str):
    idempotency_key = f"payment-{order_id}"
 
    # Check if this payment was already processed
    existing = db.query(
        "SELECT * FROM payments WHERE idempotency_key = %s",
        idempotency_key
    )
    if existing:
        return existing  # Return the previous result
 
    # Create payment record before calling provider
    payment = db.insert("payments", {
        "idempotency_key": idempotency_key,
        "order_id": order_id,
        "amount": amount,
        "status": "PENDING",
        "created_at": now()
    })
 
    try:
        # Call payment provider with idempotency key
        result = payment_provider.charge(
            amount=amount,
            payment_method=payment_method,
            idempotency_key=idempotency_key  # Provider also deduplicates
        )
        payment.update(status="COMPLETED", provider_id=result.id)
        return payment
    except PaymentDeclinedError:
        payment.update(status="FAILED")
        raise

Two layers of idempotency:

›Application layer: The Order Service checks its own database before calling the payment provider.
›Provider layer: Stripe, PayPal, and other providers accept an idempotency_key parameter and deduplicate on their end.

Both layers are necessary because network failures can cause the application to miss the provider's response. The provider might have charged successfully, but the application never received the confirmation. On retry, the provider's idempotency check prevents a double charge.

Event-Driven Order Updates

Every state change in the order lifecycle is published as an event to Kafka. Downstream services consume events relevant to them:

json

{
  "event_type": "OrderConfirmed",
  "order_id": "ord_abc123",
  "user_id": "usr_456",
  "items": [
    {"product_id": "SKU-123", "quantity": 2, "price": 29.99}
  ],
  "total": 59.98,
  "timestamp": "2025-08-05T14:23:00Z"
}

Event consumers:

›Fulfillment Service — Listens for OrderConfirmed to initiate picking and packing.
›Notification Service — Listens for all order events to send status update emails/push.
›Analytics Service — Listens for OrderConfirmed and OrderCancelled for revenue tracking.
›Search/Recommendation Service — Listens for order events to update purchase history.

Event ordering guarantee: Use Kafka with order_id as the partition key. All events for a given order go to the same partition and are processed in order. This ensures a consumer never processes OrderShipped before OrderConfirmed.

Cart Service Design

The cart is a high-read, high-write, ephemeral data structure. It does not need the same durability guarantees as orders.

Storage options:

›Redis (recommended for logged-in users): Store the cart as a Redis hash. Fast reads and writes. Set a TTL of 30 days for cart expiration. If Redis goes down, the cart is lost, but that is acceptable for most e-commerce sites.
›Database (for persistence): Store cart data in a database for users who expect their cart to persist across devices and sessions. Use Redis as a write-through cache in front of the database.
›Client-side (for anonymous users): Store the cart in a cookie or local storage. Merge with the server-side cart upon login.

Redis Hash — cart:{user_id}
Fields:
  SKU-123: {"quantity": 2, "price": 29.99, "added_at": "..."}
  SKU-456: {"quantity": 1, "price": 49.99, "added_at": "..."}
TTL: 30 days

Cart-to-order conversion:

When the user clicks "Place Order," the Cart Service reads the cart, validates prices against the current catalog (prices may have changed since the item was added), validates inventory availability, and passes the validated cart to the Order Service. The Order Service creates the order and deletes the cart.

Price consistency: The price shown in the cart may differ from the current price at checkout time. The system should re-validate prices at order placement and display any changes to the user before confirming.

Data Model

Orders Table

Column	Type	Description
order_id	UUID	Primary key
user_id	UUID	Customer who placed the order
status	VARCHAR	created/reserved/confirmed/shipped/delivered/cancelled
total_amount	DECIMAL	Order total
currency	VARCHAR(3)	Currency code (USD, EUR)
shipping_address	JSONB	Delivery address
payment_method	VARCHAR	Payment method identifier
created_at	TIMESTAMP	Order creation time
updated_at	TIMESTAMP	Last status change

Order Items Table

Column	Type	Description
order_id	UUID	Foreign key to orders
product_id	VARCHAR	SKU identifier
quantity	INT	Number of units
unit_price	DECIMAL	Price at time of purchase
subtotal	DECIMAL	quantity * unit_price

Order Events Table (Append-Only Audit Log)

Column	Type	Description
event_id	UUID	Primary key
order_id	UUID	Foreign key to orders
event_type	VARCHAR	OrderCreated, PaymentCompleted, etc.
from_state	VARCHAR	Previous state
to_state	VARCHAR	New state
event_data	JSONB	Additional event metadata
created_at	TIMESTAMP	Event timestamp

Inventory Table

Column	Type	Description
product_id	VARCHAR	Primary key (SKU)
available	INT	Available for purchase
reserved	INT	Reserved by pending orders
warehouse_id	VARCHAR	Physical location
version	INT	Optimistic lock version
updated_at	TIMESTAMP	Last update time

Payments Table

Column	Type	Description
payment_id	UUID	Primary key
order_id	UUID	Foreign key to orders
idempotency_key	VARCHAR	Unique key for deduplication
amount	DECIMAL	Charged amount
currency	VARCHAR(3)	Currency code
status	VARCHAR	pending/completed/failed/refunded
provider_id	VARCHAR	External provider transaction ID
created_at	TIMESTAMP	Payment initiation time
completed_at	TIMESTAMP	Payment completion time

Scaling Considerations

Order Service scaling: The Order Service is stateless and scales horizontally. Orders are partitioned by order_id in the database. For sharding strategies, see Part 7: Sharding and Partitioning.

Inventory Service — the hot spot problem: During flash sales, a single popular product receives thousands of concurrent purchase requests. Solutions:

›Queue-based serialization: Route all requests for a product to a single Kafka partition. Process them sequentially.
›Inventory sharding: Split inventory for a hot product across multiple "virtual inventory pools." Each pool handles a fraction of the total stock. Requests are distributed across pools.
›Pre-deduct with reconciliation: Deduct inventory optimistically and reconcile asynchronously. Risk of slight overselling, handled by backorders.

Payment Service: Payment provider APIs are the bottleneck (typically 1-3 seconds per call). Use connection pooling, circuit breakers (to avoid hammering a failing provider), and multiple payment providers for failover.

Event bus (Kafka) partitioning: Partition the order events topic by order_id to guarantee per-order event ordering. The inventory events topic should be partitioned by product_id for per-product ordering.

Database choices: Use PostgreSQL for orders and payments (ACID transactions needed). Use a separate database for inventory if write throughput demands it. Use Redis for cart storage.

Trade-offs and Alternatives

Decision	Option A	Option B	Recommendation
Saga pattern	Choreography	Orchestration	Orchestration for complex flows (5+ steps)
Inventory locking	Pessimistic	Optimistic	Optimistic for normal traffic, queue-based for flash sales
Cart storage	Redis	Database	Redis with database fallback for persistence
Payment retry	Immediate	Exponential backoff	Backoff with idempotency keys
Event delivery	At-least-once	Exactly-once	At-least-once with idempotent consumers

Why not a monolithic transaction? A single distributed transaction (two-phase commit) across Inventory, Payment, and Fulfillment services would lock resources across all three databases simultaneously. This does not scale: a slow payment provider would hold locks on inventory rows, blocking other users from purchasing. The saga pattern releases locks immediately after each local transaction, allowing services to scale independently.

Why not event sourcing for everything? Event sourcing (rebuilding state by replaying events) is powerful for audit trails and debugging, but it adds complexity for simple CRUD operations like cart management. Use event sourcing for the order lifecycle (where the audit trail is valuable) and traditional CRUD for the cart and user preferences.

Handling partial failures: If payment succeeds but the inventory service crashes before releasing the reservation, the system has an inconsistency. The saga orchestrator must track which steps completed and run compensating transactions. A periodic reconciliation job compares order states across services and flags inconsistencies for manual or automated resolution.

FAQ

Why use the saga pattern instead of distributed transactions in e-commerce?

Distributed transactions (2PC) lock resources and do not scale. The saga pattern breaks the order flow into local transactions with compensating actions, allowing each service to scale independently while maintaining eventual consistency. In a 2PC approach, the payment provider, inventory database, and order database all hold locks simultaneously. If the payment provider takes 3 seconds to respond, the inventory row is locked for 3 seconds, blocking other purchases of that product. With sagas, the inventory is reserved (local transaction completes and releases the lock immediately), then payment is processed separately. If payment fails, a compensating transaction releases the inventory reservation.

How do you prevent overselling inventory in a high-traffic system?

Use optimistic locking with version checks, reserve inventory with TTL-based holds during checkout, and process orders through a single-partition queue per product to serialize inventory updates without distributed locks. The TTL-based reservation is critical: when a user begins checkout, the system temporarily reserves their items for 10 minutes. If they do not complete the purchase, the reservation expires automatically and the stock becomes available again. During flash sales, the queue-based approach is most effective because it eliminates contention entirely by processing one purchase at a time per product.

How should the system handle payment failures during order processing?

Implement compensating transactions that release reserved inventory and notify the user. Use idempotency keys for payment retries, store payment state in a state machine, and support multiple payment method fallbacks. The idempotency key is the most critical piece: it ensures that if the client retries a failed payment request (because the response was lost due to a network error), the payment provider recognizes the duplicate and returns the original result instead of charging again. The order stays in RESERVED state during payment retry, and the inventory reservation TTL is extended to prevent the reserved items from being released while the payment is being retried.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

Start a Conversation

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Serverless Databases at Scale

Implement Infinite Scroll with React Query and Next.js

Monitoring, Observability, and Site Reliability

Dec 10, 20259 min read

System Design

Observability

Monitoring

Monitoring, Observability, and Site Reliability

Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.

Read Article

Nov 12, 202510 min read

System Design

CAP Theorem

Distributed Systems

CAP Theorem and Distributed Consensus

Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.

Read Article

Design a Rate Limiter: Algorithms and Implementation

Nov 05, 202511 min read

System Design

Rate Limiting

Algorithms

Design a Rate Limiter: Algorithms and Implementation

Build a distributed rate limiter using token bucket, sliding window, and leaky bucket algorithms. Covers Redis-based implementation and API gateway integration.

Read Article

Design an E-Commerce Order Processing System

Design an E-Commerce Order Processing System

TL;DR

Requirements

Functional Requirements

Non-Functional Requirements

Back-of-Envelope Estimation

High-Level Design

Detailed Design

Order State Machine

Saga Pattern for Distributed Transactions

Inventory Reservation: Pessimistic vs Optimistic Locking

Idempotency Keys for Payments

Event-Driven Order Updates

Cart Service Design

Data Model

Orders Table

Order Items Table

Order Events Table (Append-Only Audit Log)

Inventory Table

Payments Table

Scaling Considerations

Trade-offs and Alternatives

FAQ

Why use the saga pattern instead of distributed transactions in e-commerce?

How do you prevent overselling inventory in a high-traffic system?

How should the system handle payment failures during order processing?

Need help with a project?

Let's Build It

Sadam Hussain

Related Articles

Monitoring, Observability, and Site Reliability

CAP Theorem and Distributed Consensus

Design a Rate Limiter: Algorithms and Implementation