Blog/System Design/Message Queues and Event-Driven Systems
POST
August 22, 2025
LAST UPDATEDAugust 22, 2025

Message Queues and Event-Driven Systems

Learn how message queues like Kafka, RabbitMQ, and SQS enable event-driven architectures. Understand pub/sub patterns, guaranteed delivery, and async processing.

Tags

System DesignMessage QueuesKafkaEvent-Driven
Message Queues and Event-Driven Systems
8 min read

Message Queues and Event-Driven Systems

This is Part 6 of the System Design from Zero to Hero series.

TL;DR

Message queues decouple producers from consumers, enabling asynchronous processing, better fault tolerance, and the ability to handle traffic spikes without dropping requests. Understanding when to use Kafka vs RabbitMQ vs SQS, and how to handle delivery guarantees, is essential for designing resilient distributed systems.

Why This Matters

In Part 1, we discussed the building blocks of distributed systems. In Part 2, we covered scaling strategies. But there is a fundamental problem that scaling alone does not solve: what happens when Service A needs to talk to Service B, but Service B is slow, overloaded, or temporarily down?

Direct synchronous API calls create tight coupling. If a payment service takes 3 seconds to process a transaction, every upstream service calling it synchronously is blocked for those 3 seconds. If the payment service goes down, every caller fails. Message queues break this dependency chain.

In production systems at scale, asynchronous communication through message queues is not optional -- it is the backbone of reliability. Order processing, notification delivery, log aggregation, data pipeline ingestion -- all of these rely on message queues to function under real-world conditions.

Core Concepts

Pub/Sub vs Point-to-Point

There are two fundamental messaging patterns:

Point-to-Point (Queue): A message is consumed by exactly one consumer. Think of a work queue where tasks are distributed across workers. RabbitMQ queues and Amazon SQS operate this way by default.

Publish/Subscribe (Pub/Sub): A message is broadcast to all subscribers. Think of a notification system where multiple services need to react to the same event. Kafka topics, RabbitMQ fanout exchanges, and Amazon SNS operate this way.

The choice between these patterns shapes your entire architecture. Point-to-point is for distributing work. Pub/sub is for broadcasting events.

Apache Kafka Architecture

Kafka is not a traditional message queue -- it is a distributed commit log. This distinction matters because it fundamentally changes how you think about message consumption.

Brokers: Kafka runs as a cluster of broker nodes. Each broker stores a subset of the data. In production, you typically run at least 3 brokers for fault tolerance.

Topics and Partitions: A topic is a logical stream of events. Each topic is split into partitions, which are the unit of parallelism. Messages within a partition are strictly ordered. If you need global ordering, you use a single partition (at the cost of throughput).

Consumer Groups: Multiple consumers form a consumer group. Kafka assigns each partition to exactly one consumer within a group. This means you can scale consumers up to the number of partitions -- adding more consumers than partitions leaves them idle.

python
# Kafka producer example using confluent-kafka
from confluent_kafka import Producer
 
conf = {
    'bootstrap.servers': 'broker1:9092,broker2:9092,broker3:9092',
    'acks': 'all',                # Wait for all replicas to acknowledge
    'enable.idempotence': True,   # Prevent duplicate messages on retries
    'retries': 5,
    'retry.backoff.ms': 100,
}
 
producer = Producer(conf)
 
def delivery_callback(err, msg):
    if err:
        print(f"Delivery failed: {err}")
    else:
        print(f"Delivered to {msg.topic()} [{msg.partition()}] @ {msg.offset()}")
 
# Partition key ensures all events for the same order go to the same partition
producer.produce(
    topic='order-events',
    key=b'order-12345',
    value=b'{"event": "order_created", "amount": 99.99}',
    callback=delivery_callback
)
producer.flush()
python
# Kafka consumer example
from confluent_kafka import Consumer
 
conf = {
    'bootstrap.servers': 'broker1:9092,broker2:9092,broker3:9092',
    'group.id': 'order-processing-group',
    'auto.offset.reset': 'earliest',
    'enable.auto.commit': False,   # Manual commit for at-least-once delivery
}
 
consumer = Consumer(conf)
consumer.subscribe(['order-events'])
 
while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None:
        continue
    if msg.error():
        print(f"Consumer error: {msg.error()}")
        continue
 
    # Process the message
    process_order_event(msg.value())
 
    # Commit offset only after successful processing
    consumer.commit(msg)

Replication: Each partition has a configurable replication factor. One replica is the leader (handles all reads and writes), and the others are followers. If the leader fails, a follower is elected as the new leader. The acks=all setting ensures the producer waits for all in-sync replicas to acknowledge the write.

RabbitMQ Architecture

RabbitMQ follows the AMQP protocol with a different model: exchanges, queues, and bindings.

Exchanges receive messages from producers and route them to queues based on rules:

  • Direct exchange: Routes to queues matching the exact routing key
  • Fanout exchange: Broadcasts to all bound queues (classic pub/sub)
  • Topic exchange: Routes based on pattern matching (e.g., order.*.created)
  • Headers exchange: Routes based on message header attributes

Queues store messages until consumers acknowledge them. Unlike Kafka, once a message is acknowledged, it is deleted from the queue.

Bindings connect exchanges to queues with optional routing keys.

python
# RabbitMQ producer with exchange routing
import pika
 
connection = pika.BlockingConnection(
    pika.ConnectionParameters('rabbitmq-host')
)
channel = connection.channel()
 
# Declare a topic exchange
channel.exchange_declare(exchange='order_events', exchange_type='topic')
 
# Publish with routing key for topic-based routing
channel.basic_publish(
    exchange='order_events',
    routing_key='order.payment.completed',
    body='{"order_id": "12345", "amount": 99.99}',
    properties=pika.BasicProperties(
        delivery_mode=2,  # Persistent message
        content_type='application/json',
    )
)
connection.close()

Amazon SQS

SQS is AWS's fully managed queue service. It trades flexibility for operational simplicity -- no brokers to manage, no partitions to configure. There are two flavors:

  • Standard Queues: At-least-once delivery, best-effort ordering, nearly unlimited throughput
  • FIFO Queues: Exactly-once processing, strict ordering, limited to 3,000 messages/second with batching

SQS is the right choice when you want a simple work queue without managing infrastructure.

Event Sourcing and CQRS

Event Sourcing stores every state change as an immutable event rather than overwriting the current state. Instead of a users table with the latest data, you have an event stream: UserCreated, EmailChanged, AccountDeactivated. The current state is derived by replaying events.

This pairs naturally with message queues -- Kafka's append-only log is essentially an event store.

CQRS (Command Query Responsibility Segregation) separates the write model from the read model. Commands modify state by appending events. Queries read from materialized views optimized for specific access patterns.

Command (Write) --> Event Store (Kafka) --> Event Processor --> Read Database (optimized views)
                                       --> Analytics Service
                                       --> Notification Service

This architecture, combined with the caching strategies from Part 5, enables systems to handle massive read loads while maintaining a clean write path.

Dead Letter Queues

When a message fails processing repeatedly, you do not want it blocking the queue forever. Dead letter queues (DLQs) capture failed messages after a configurable number of retry attempts.

python
# SQS dead letter queue configuration (AWS CDK)
from aws_cdk import aws_sqs as sqs
from constructs import Construct
 
class OrderQueueStack(Construct):
    def __init__(self, scope, id):
        super().__init__(scope, id)
 
        dlq = sqs.Queue(self, "OrderDLQ",
            retention_period=Duration.days(14)
        )
 
        main_queue = sqs.Queue(self, "OrderQueue",
            dead_letter_queue=sqs.DeadLetterQueue(
                max_receive_count=3,  # Move to DLQ after 3 failed attempts
                queue=dlq
            )
        )

Monitor your DLQ size as a critical metric. A growing DLQ indicates a systemic problem -- not just transient failures.

Back-Pressure Handling

When consumers cannot keep up with producers, you need a back-pressure strategy:

  1. Buffering: Let the queue absorb the spike (Kafka excels here with disk-based storage)
  2. Dropping: Discard messages that exceed a threshold (acceptable for metrics, not for orders)
  3. Throttling: Slow down producers when queue depth exceeds a threshold
  4. Scaling: Auto-scale consumers based on queue depth or consumer lag

Delivery Guarantees

  • At-most-once: Fire and forget. Message may be lost. Fastest but least reliable.
  • At-least-once: Retry until acknowledged. Message may be delivered multiple times. Requires idempotent consumers.
  • Exactly-once: Each message processed exactly once. Extremely difficult in distributed systems. Kafka achieves this within its ecosystem using idempotent producers and transactional consumers, but true end-to-end exactly-once across system boundaries requires idempotent downstream operations.

The practical approach: design for at-least-once delivery with idempotent consumers. Use a deduplication key (like an order ID) and check before processing.

Practical Implementation

Here is a pattern for idempotent message processing:

python
import hashlib
import redis
 
redis_client = redis.Redis(host='localhost', port=6379, db=0)
DEDUP_TTL = 86400  # 24 hours
 
def process_message_idempotently(message):
    # Generate a unique key for deduplication
    dedup_key = f"processed:{message['event_id']}"
 
    # Check if already processed (atomic set-if-not-exists)
    if not redis_client.set(dedup_key, "1", nx=True, ex=DEDUP_TTL):
        print(f"Duplicate message {message['event_id']}, skipping")
        return
 
    try:
        # Process the message
        handle_order_event(message)
    except Exception as e:
        # Remove dedup key so message can be retried
        redis_client.delete(dedup_key)
        raise

This uses Redis (as we discussed in Part 5: Caching) as a fast deduplication store.

Trade-offs and Decision Framework

FactorKafkaRabbitMQSQS
ThroughputVery high (millions/sec)Moderate (tens of thousands/sec)High (managed)
OrderingPer-partitionPer-queueFIFO queues only
Message ReplayYes (retained on disk)No (deleted after ack)No
Routing FlexibilityLimited (partition key)Rich (exchanges, bindings)Limited
Operational OverheadHigh (ZooKeeper/KRaft)MediumNone (managed)
Use CaseEvent streaming, logs, CDCTask queues, complex routingSimple cloud-native queues

Choose Kafka when you need event replay, high throughput, or stream processing. Choose RabbitMQ when you need complex routing, priority queues, or request/reply patterns. Choose SQS when you want a managed solution and do not need message replay or complex routing.

Common Interview Questions

Q: How would you design a system to handle 100,000 order events per second? A: Use Kafka with enough partitions (at least 100) to distribute the load. Partition by order ID for ordering guarantees per order. Use consumer groups with auto-scaling based on consumer lag. Ensure idempotent processing with a deduplication store.

Q: A consumer keeps failing on a specific message. How do you handle it? A: Configure a dead letter queue with a max retry count (e.g., 3 attempts). Failed messages move to the DLQ for investigation. Set up alerts on DLQ depth. Implement a separate DLQ processor that can retry with manual intervention or route to an error-handling workflow.

Q: How do you ensure ordering in a distributed message queue? A: In Kafka, ordering is guaranteed within a partition. Use a consistent partition key (like user ID or order ID) to ensure related messages go to the same partition. If you need global ordering, use a single partition but accept the throughput limitation.

Q: What happens to in-flight messages if a Kafka consumer crashes? A: If using manual offset commits, uncommitted messages will be redelivered to another consumer in the group after the session timeout. This is why at-least-once delivery requires idempotent processing -- the same message may be processed twice during consumer rebalancing.

What's Next

Now that we understand asynchronous communication between services, Part 7: Database Sharding and Partitioning tackles how to split your data layer across multiple machines when a single database can no longer handle the load.

FAQ

When should I use a message queue instead of direct API calls?

Use message queues when the downstream service is slow, unreliable, or when you need to process tasks asynchronously. Examples include sending emails, processing payments, and generating reports.

What is the difference between Kafka and RabbitMQ?

Kafka is a distributed log optimized for high-throughput event streaming and replay. RabbitMQ is a traditional message broker with flexible routing, priority queues, and better support for complex messaging patterns.

How do I guarantee exactly-once message processing?

True exactly-once is nearly impossible in distributed systems. Instead, design for at-least-once delivery with idempotent consumers that safely handle duplicate messages using unique identifiers or deduplication windows.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

SH

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Related Articles

Design an E-Commerce Order Processing System
Jan 10, 202612 min read
System Design
E-Commerce
Saga Pattern

Design an E-Commerce Order Processing System

Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.

Monitoring, Observability, and Site Reliability
Dec 10, 20259 min read
System Design
Observability
Monitoring

Monitoring, Observability, and Site Reliability

Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.

CAP Theorem and Distributed Consensus
Nov 12, 202510 min read
System Design
CAP Theorem
Distributed Systems

CAP Theorem and Distributed Consensus

Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.