Blog/System Design/Design an E-Commerce Order Processing System
POST
January 10, 2026
LAST UPDATEDJanuary 10, 2026

Design an E-Commerce Order Processing System

Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.

Tags

System DesignE-CommerceSaga PatternPayments
Design an E-Commerce Order Processing System
12 min read

Design an E-Commerce Order Processing System

This post applies concepts from the System Design from Zero to Hero series.

TL;DR

An e-commerce order system uses the saga pattern to coordinate inventory, payment, and fulfillment services, with compensating transactions to handle failures gracefully. The order follows a state machine from cart through payment to delivery, with each transition triggering events that downstream services consume. Inventory reservation uses optimistic locking to prevent overselling, and idempotency keys ensure payment retries do not charge customers twice. The entire system is event-driven, with each service owning its own data and communicating through a message broker.

Requirements

Functional Requirements

  1. Cart management — Users can add/remove items, update quantities, and view their cart.
  2. Order placement — Convert a cart into an order with address, payment method, and delivery preference.
  3. Payment processing — Charge the customer and handle payment failures, refunds, and retries.
  4. Inventory management — Track stock levels, reserve inventory during checkout, and prevent overselling.
  5. Order tracking — Display real-time order status (confirmed, processing, shipped, delivered).
  6. Order cancellation — Allow cancellation before shipment, triggering refund and inventory release.

Non-Functional Requirements

  1. Consistency — Inventory must never go negative. A customer must never be charged without an order being created.
  2. Idempotency — Payment retries must not result in double charges.
  3. Availability — The checkout flow must be available 99.99% of the time during peak traffic.
  4. Scalability — Handle 10,000 orders per second during flash sales.
  5. Fault tolerance — If any service fails mid-transaction, the system must recover to a consistent state.

Back-of-Envelope Estimation

Assume a large e-commerce platform during a flash sale:

  • Peak order rate: 10,000 orders/second
  • Average items per order: 2.5 items → 25,000 inventory operations/second
  • Payment processing: 10,000 payment requests/second (each taking 1-3 seconds with payment provider)
  • Cart reads: 100,000/second (users browsing, much higher than order rate)
  • Order storage: ~2 KB per order → 10,000 * 2KB = 20 MB/second → ~1.7 TB/day during peak
  • Inventory updates: Must serialize per-SKU to prevent overselling → partition by product_id

High-Level Design

Client → API Gateway → Cart Service
                          ↓
                    Order Service → Event Bus (Kafka)
                      ↙    ↓      ↘
              Payment    Inventory   Fulfillment
              Service    Service     Service
                ↓          ↓           ↓
            Payment     Inventory    Shipping
            Provider    Database     Provider

Order placement flow:

  1. User clicks "Place Order." The API gateway calls the Order Service.
  2. The Order Service creates an order in PENDING state and publishes an OrderCreated event.
  3. The Inventory Service consumes the event, reserves stock, and publishes InventoryReserved.
  4. The Payment Service consumes the event, charges the customer, and publishes PaymentCompleted.
  5. The Order Service consumes both events and transitions the order to CONFIRMED.
  6. The Fulfillment Service picks up the confirmed order for shipping.

If any step fails, compensating transactions undo the previous steps (release inventory, refund payment).

Detailed Design

Order State Machine

The order follows a well-defined state machine. Every transition is triggered by an event, and only valid transitions are allowed.

                    ┌──────────────┐
                    │   CREATED    │
                    └──────┬───────┘
                           │ inventory reserved
                    ┌──────▼───────┐
                    │   RESERVED   │──── inventory failed ──→ CANCELLED
                    └──────┬───────┘
                           │ payment completed
                    ┌──────▼───────┐
                    │  CONFIRMED   │──── payment failed ──→ RESERVED (retry)
                    └──────┬───────┘                              │
                           │ shipped                        release inventory
                    ┌──────▼───────┐                              │
                    │   SHIPPED    │                          CANCELLED
                    └──────┬───────┘
                           │ delivered
                    ┌──────▼───────┐
                    │  DELIVERED   │
                    └──────────────┘

Each state transition is persisted as an event in an append-only order events table. This gives a full audit trail and enables event sourcing if needed.

python
class OrderState(Enum):
    CREATED = "created"
    RESERVED = "reserved"
    CONFIRMED = "confirmed"
    SHIPPED = "shipped"
    DELIVERED = "delivered"
    CANCELLED = "cancelled"
    REFUND_PENDING = "refund_pending"
    REFUNDED = "refunded"
 
VALID_TRANSITIONS = {
    OrderState.CREATED: [OrderState.RESERVED, OrderState.CANCELLED],
    OrderState.RESERVED: [OrderState.CONFIRMED, OrderState.CANCELLED],
    OrderState.CONFIRMED: [OrderState.SHIPPED, OrderState.CANCELLED],
    OrderState.SHIPPED: [OrderState.DELIVERED],
    OrderState.DELIVERED: [OrderState.REFUND_PENDING],
    OrderState.REFUND_PENDING: [OrderState.REFUNDED],
}
 
def transition_order(order, new_state, event_data):
    if new_state not in VALID_TRANSITIONS.get(order.state, []):
        raise InvalidTransitionError(
            f"Cannot transition from {order.state} to {new_state}"
        )
    order.state = new_state
    order.updated_at = now()
    order_events.append(OrderEvent(
        order_id=order.id,
        from_state=order.state,
        to_state=new_state,
        event_data=event_data,
        timestamp=now()
    ))

Saga Pattern for Distributed Transactions

An order involves multiple services: Inventory, Payment, and Fulfillment. A traditional database transaction cannot span these services. The saga pattern breaks the process into a sequence of local transactions, each with a compensating action that undoes it on failure. For event-driven architecture patterns, see Part 6: Message Queues and Event-Driven Architecture.

Choreography-based saga (event-driven):

Each service listens for events and reacts independently. No central coordinator.

1. Order Service    → publishes OrderCreated
2. Inventory Service → consumes OrderCreated
                     → reserves inventory
                     → publishes InventoryReserved (or InventoryFailed)
3. Payment Service  → consumes InventoryReserved
                     → charges payment
                     → publishes PaymentCompleted (or PaymentFailed)
4. Order Service    → consumes PaymentCompleted
                     → updates order to CONFIRMED
                     → publishes OrderConfirmed
5. Fulfillment      → consumes OrderConfirmed
                     → initiates shipping

Compensating transactions (on failure):

PaymentFailed event:
  → Inventory Service releases reserved stock
  → Order Service marks order as CANCELLED
  → Notification Service sends failure email to user

InventoryFailed event:
  → Order Service marks order as CANCELLED
  → Notification Service sends "out of stock" email

Orchestration-based saga (alternative):

A central Order Saga Orchestrator coordinates the steps explicitly:

python
class OrderSagaOrchestrator:
    def execute(self, order):
        try:
            # Step 1: Reserve inventory
            inventory_result = inventory_service.reserve(
                order.items, order.id
            )
            if not inventory_result.success:
                self.cancel_order(order, "Inventory unavailable")
                return
 
            # Step 2: Process payment
            payment_result = payment_service.charge(
                order.user_id, order.total,
                idempotency_key=f"order-{order.id}"
            )
            if not payment_result.success:
                # Compensate: release inventory
                inventory_service.release(order.items, order.id)
                self.cancel_order(order, "Payment failed")
                return
 
            # Step 3: Confirm order
            order.transition(OrderState.CONFIRMED)
            event_bus.publish("OrderConfirmed", order)
 
        except Exception as e:
            # Compensate all completed steps
            self.compensate(order, completed_steps)

Choreography vs Orchestration:

  • Choreography is simpler for small sagas (3-4 steps) but becomes hard to trace and debug as complexity grows.
  • Orchestration centralizes the flow logic, making it easier to understand, test, and monitor. Preferred for complex multi-step flows.

Inventory Reservation: Pessimistic vs Optimistic Locking

The core inventory challenge is preventing overselling: two users trying to buy the last item must not both succeed. For database-level locking strategies, see Part 7: Sharding and Partitioning.

Pessimistic locking (SELECT FOR UPDATE):

Lock the inventory row when reading it. Other transactions must wait until the lock is released.

sql
BEGIN;
SELECT quantity FROM inventory WHERE product_id = 'SKU123' FOR UPDATE;
-- If quantity >= requested_amount:
UPDATE inventory SET quantity = quantity - 1,
       reserved = reserved + 1
WHERE product_id = 'SKU123';
COMMIT;

Pros: Guarantees no overselling. Cons: Lock contention. During a flash sale, thousands of concurrent requests for the same product will serialize, creating a bottleneck. Throughput drops to the database's lock processing speed.

Optimistic locking (version-based):

Read the inventory row with its version number. On update, check that the version has not changed. If it has, retry.

sql
-- Read
SELECT quantity, version FROM inventory WHERE product_id = 'SKU123';
-- quantity = 10, version = 42
 
-- Update with version check
UPDATE inventory
SET quantity = quantity - 1, reserved = reserved + 1, version = version + 1
WHERE product_id = 'SKU123' AND version = 42;
-- If rows_affected = 0, version changed → retry

Pros: No locks held. Higher throughput under moderate contention. Cons: Under high contention (flash sales), most retries fail, wasting resources. Works well when conflicts are rare.

Recommended approach for flash sales:

Use a single-partition message queue per product. All purchase requests for a given SKU are routed to the same queue partition and processed serially. This eliminates lock contention entirely while maintaining strict ordering.

Purchase requests for SKU-123 → Kafka partition (key=SKU-123) → Single consumer

The consumer processes one request at a time: check stock, decrement, confirm. Since there is only one consumer per partition, there is no concurrent access and no locking needed.

Reservation with TTL:

When a user adds an item to their cart, the system does not decrement stock immediately. Instead, it creates a time-limited reservation (e.g., 10 minutes). If the user does not complete checkout within the TTL, the reservation expires and the stock is released back. This prevents cart abandonment from permanently reducing available inventory.

python
def reserve_inventory(product_id: str, quantity: int, order_id: str):
    reservation = {
        "product_id": product_id,
        "quantity": quantity,
        "order_id": order_id,
        "expires_at": now() + timedelta(minutes=10)
    }
    # Atomic: decrement available, increment reserved
    result = db.execute("""
        UPDATE inventory
        SET available = available - %s, reserved = reserved + %s
        WHERE product_id = %s AND available >= %s
    """, (quantity, quantity, product_id, quantity))
 
    if result.rows_affected == 0:
        raise InsufficientStockError()
 
    # Schedule TTL expiration
    scheduler.schedule_at(reservation["expires_at"],
                         release_reservation, order_id)

Idempotency Keys for Payments

Payment processing must be idempotent. If the client retries a payment request (due to a timeout or network error), the payment provider must not charge the customer again. The solution is an idempotency key: a unique identifier sent with each payment request.

python
def process_payment(order_id: str, amount: Decimal, payment_method: str):
    idempotency_key = f"payment-{order_id}"
 
    # Check if this payment was already processed
    existing = db.query(
        "SELECT * FROM payments WHERE idempotency_key = %s",
        idempotency_key
    )
    if existing:
        return existing  # Return the previous result
 
    # Create payment record before calling provider
    payment = db.insert("payments", {
        "idempotency_key": idempotency_key,
        "order_id": order_id,
        "amount": amount,
        "status": "PENDING",
        "created_at": now()
    })
 
    try:
        # Call payment provider with idempotency key
        result = payment_provider.charge(
            amount=amount,
            payment_method=payment_method,
            idempotency_key=idempotency_key  # Provider also deduplicates
        )
        payment.update(status="COMPLETED", provider_id=result.id)
        return payment
    except PaymentDeclinedError:
        payment.update(status="FAILED")
        raise

Two layers of idempotency:

  1. Application layer: The Order Service checks its own database before calling the payment provider.
  2. Provider layer: Stripe, PayPal, and other providers accept an idempotency_key parameter and deduplicate on their end.

Both layers are necessary because network failures can cause the application to miss the provider's response. The provider might have charged successfully, but the application never received the confirmation. On retry, the provider's idempotency check prevents a double charge.

Event-Driven Order Updates

Every state change in the order lifecycle is published as an event to Kafka. Downstream services consume events relevant to them:

json
{
  "event_type": "OrderConfirmed",
  "order_id": "ord_abc123",
  "user_id": "usr_456",
  "items": [
    {"product_id": "SKU-123", "quantity": 2, "price": 29.99}
  ],
  "total": 59.98,
  "timestamp": "2025-08-05T14:23:00Z"
}

Event consumers:

  • Fulfillment Service — Listens for OrderConfirmed to initiate picking and packing.
  • Notification Service — Listens for all order events to send status update emails/push.
  • Analytics Service — Listens for OrderConfirmed and OrderCancelled for revenue tracking.
  • Search/Recommendation Service — Listens for order events to update purchase history.

Event ordering guarantee: Use Kafka with order_id as the partition key. All events for a given order go to the same partition and are processed in order. This ensures a consumer never processes OrderShipped before OrderConfirmed.

Cart Service Design

The cart is a high-read, high-write, ephemeral data structure. It does not need the same durability guarantees as orders.

Storage options:

  • Redis (recommended for logged-in users): Store the cart as a Redis hash. Fast reads and writes. Set a TTL of 30 days for cart expiration. If Redis goes down, the cart is lost, but that is acceptable for most e-commerce sites.
  • Database (for persistence): Store cart data in a database for users who expect their cart to persist across devices and sessions. Use Redis as a write-through cache in front of the database.
  • Client-side (for anonymous users): Store the cart in a cookie or local storage. Merge with the server-side cart upon login.
Redis Hash — cart:{user_id}
Fields:
  SKU-123: {"quantity": 2, "price": 29.99, "added_at": "..."}
  SKU-456: {"quantity": 1, "price": 49.99, "added_at": "..."}
TTL: 30 days

Cart-to-order conversion:

When the user clicks "Place Order," the Cart Service reads the cart, validates prices against the current catalog (prices may have changed since the item was added), validates inventory availability, and passes the validated cart to the Order Service. The Order Service creates the order and deletes the cart.

Price consistency: The price shown in the cart may differ from the current price at checkout time. The system should re-validate prices at order placement and display any changes to the user before confirming.

Data Model

Orders Table

ColumnTypeDescription
order_idUUIDPrimary key
user_idUUIDCustomer who placed the order
statusVARCHARcreated/reserved/confirmed/shipped/delivered/cancelled
total_amountDECIMALOrder total
currencyVARCHAR(3)Currency code (USD, EUR)
shipping_addressJSONBDelivery address
payment_methodVARCHARPayment method identifier
created_atTIMESTAMPOrder creation time
updated_atTIMESTAMPLast status change

Order Items Table

ColumnTypeDescription
order_idUUIDForeign key to orders
product_idVARCHARSKU identifier
quantityINTNumber of units
unit_priceDECIMALPrice at time of purchase
subtotalDECIMALquantity * unit_price

Order Events Table (Append-Only Audit Log)

ColumnTypeDescription
event_idUUIDPrimary key
order_idUUIDForeign key to orders
event_typeVARCHAROrderCreated, PaymentCompleted, etc.
from_stateVARCHARPrevious state
to_stateVARCHARNew state
event_dataJSONBAdditional event metadata
created_atTIMESTAMPEvent timestamp

Inventory Table

ColumnTypeDescription
product_idVARCHARPrimary key (SKU)
availableINTAvailable for purchase
reservedINTReserved by pending orders
warehouse_idVARCHARPhysical location
versionINTOptimistic lock version
updated_atTIMESTAMPLast update time

Payments Table

ColumnTypeDescription
payment_idUUIDPrimary key
order_idUUIDForeign key to orders
idempotency_keyVARCHARUnique key for deduplication
amountDECIMALCharged amount
currencyVARCHAR(3)Currency code
statusVARCHARpending/completed/failed/refunded
provider_idVARCHARExternal provider transaction ID
created_atTIMESTAMPPayment initiation time
completed_atTIMESTAMPPayment completion time

Scaling Considerations

Order Service scaling: The Order Service is stateless and scales horizontally. Orders are partitioned by order_id in the database. For sharding strategies, see Part 7: Sharding and Partitioning.

Inventory Service — the hot spot problem: During flash sales, a single popular product receives thousands of concurrent purchase requests. Solutions:

  1. Queue-based serialization: Route all requests for a product to a single Kafka partition. Process them sequentially.
  2. Inventory sharding: Split inventory for a hot product across multiple "virtual inventory pools." Each pool handles a fraction of the total stock. Requests are distributed across pools.
  3. Pre-deduct with reconciliation: Deduct inventory optimistically and reconcile asynchronously. Risk of slight overselling, handled by backorders.

Payment Service: Payment provider APIs are the bottleneck (typically 1-3 seconds per call). Use connection pooling, circuit breakers (to avoid hammering a failing provider), and multiple payment providers for failover.

Event bus (Kafka) partitioning: Partition the order events topic by order_id to guarantee per-order event ordering. The inventory events topic should be partitioned by product_id for per-product ordering.

Database choices: Use PostgreSQL for orders and payments (ACID transactions needed). Use a separate database for inventory if write throughput demands it. Use Redis for cart storage.

Trade-offs and Alternatives

DecisionOption AOption BRecommendation
Saga patternChoreographyOrchestrationOrchestration for complex flows (5+ steps)
Inventory lockingPessimisticOptimisticOptimistic for normal traffic, queue-based for flash sales
Cart storageRedisDatabaseRedis with database fallback for persistence
Payment retryImmediateExponential backoffBackoff with idempotency keys
Event deliveryAt-least-onceExactly-onceAt-least-once with idempotent consumers

Why not a monolithic transaction? A single distributed transaction (two-phase commit) across Inventory, Payment, and Fulfillment services would lock resources across all three databases simultaneously. This does not scale: a slow payment provider would hold locks on inventory rows, blocking other users from purchasing. The saga pattern releases locks immediately after each local transaction, allowing services to scale independently.

Why not event sourcing for everything? Event sourcing (rebuilding state by replaying events) is powerful for audit trails and debugging, but it adds complexity for simple CRUD operations like cart management. Use event sourcing for the order lifecycle (where the audit trail is valuable) and traditional CRUD for the cart and user preferences.

Handling partial failures: If payment succeeds but the inventory service crashes before releasing the reservation, the system has an inconsistency. The saga orchestrator must track which steps completed and run compensating transactions. A periodic reconciliation job compares order states across services and flags inconsistencies for manual or automated resolution.

FAQ

Why use the saga pattern instead of distributed transactions in e-commerce?

Distributed transactions (2PC) lock resources and do not scale. The saga pattern breaks the order flow into local transactions with compensating actions, allowing each service to scale independently while maintaining eventual consistency. In a 2PC approach, the payment provider, inventory database, and order database all hold locks simultaneously. If the payment provider takes 3 seconds to respond, the inventory row is locked for 3 seconds, blocking other purchases of that product. With sagas, the inventory is reserved (local transaction completes and releases the lock immediately), then payment is processed separately. If payment fails, a compensating transaction releases the inventory reservation.

How do you prevent overselling inventory in a high-traffic system?

Use optimistic locking with version checks, reserve inventory with TTL-based holds during checkout, and process orders through a single-partition queue per product to serialize inventory updates without distributed locks. The TTL-based reservation is critical: when a user begins checkout, the system temporarily reserves their items for 10 minutes. If they do not complete the purchase, the reservation expires automatically and the stock becomes available again. During flash sales, the queue-based approach is most effective because it eliminates contention entirely by processing one purchase at a time per product.

How should the system handle payment failures during order processing?

Implement compensating transactions that release reserved inventory and notify the user. Use idempotency keys for payment retries, store payment state in a state machine, and support multiple payment method fallbacks. The idempotency key is the most critical piece: it ensures that if the client retries a failed payment request (because the response was lost due to a network error), the payment provider recognizes the duplicate and returns the original result instead of charging again. The order stays in RESERVED state during payment retry, and the inventory reservation TTL is extended to prevent the reserved items from being released while the payment is being retried.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

SH

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Related Articles

Monitoring, Observability, and Site Reliability
Dec 10, 20259 min read
System Design
Observability
Monitoring

Monitoring, Observability, and Site Reliability

Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.

CAP Theorem and Distributed Consensus
Nov 12, 202510 min read
System Design
CAP Theorem
Distributed Systems

CAP Theorem and Distributed Consensus

Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.

Design a Rate Limiter: Algorithms and Implementation
Nov 05, 202511 min read
System Design
Rate Limiting
Algorithms

Design a Rate Limiter: Algorithms and Implementation

Build a distributed rate limiter using token bucket, sliding window, and leaky bucket algorithms. Covers Redis-based implementation and API gateway integration.