Message Queues and Event-Driven Systems
Learn how message queues like Kafka, RabbitMQ, and SQS enable event-driven architectures. Understand pub/sub patterns, guaranteed delivery, and async processing.
Tags
Message Queues and Event-Driven Systems
This is Part 6 of the System Design from Zero to Hero series.
TL;DR
Message queues decouple producers from consumers, enabling asynchronous processing, better fault tolerance, and the ability to handle traffic spikes without dropping requests. Understanding when to use Kafka vs RabbitMQ vs SQS, and how to handle delivery guarantees, is essential for designing resilient distributed systems.
Why This Matters
In Part 1, we discussed the building blocks of distributed systems. In Part 2, we covered scaling strategies. But there is a fundamental problem that scaling alone does not solve: what happens when Service A needs to talk to Service B, but Service B is slow, overloaded, or temporarily down?
Direct synchronous API calls create tight coupling. If a payment service takes 3 seconds to process a transaction, every upstream service calling it synchronously is blocked for those 3 seconds. If the payment service goes down, every caller fails. Message queues break this dependency chain.
In production systems at scale, asynchronous communication through message queues is not optional -- it is the backbone of reliability. Order processing, notification delivery, log aggregation, data pipeline ingestion -- all of these rely on message queues to function under real-world conditions.
Core Concepts
Pub/Sub vs Point-to-Point
There are two fundamental messaging patterns:
Point-to-Point (Queue): A message is consumed by exactly one consumer. Think of a work queue where tasks are distributed across workers. RabbitMQ queues and Amazon SQS operate this way by default.
Publish/Subscribe (Pub/Sub): A message is broadcast to all subscribers. Think of a notification system where multiple services need to react to the same event. Kafka topics, RabbitMQ fanout exchanges, and Amazon SNS operate this way.
The choice between these patterns shapes your entire architecture. Point-to-point is for distributing work. Pub/sub is for broadcasting events.
Apache Kafka Architecture
Kafka is not a traditional message queue -- it is a distributed commit log. This distinction matters because it fundamentally changes how you think about message consumption.
Brokers: Kafka runs as a cluster of broker nodes. Each broker stores a subset of the data. In production, you typically run at least 3 brokers for fault tolerance.
Topics and Partitions: A topic is a logical stream of events. Each topic is split into partitions, which are the unit of parallelism. Messages within a partition are strictly ordered. If you need global ordering, you use a single partition (at the cost of throughput).
Consumer Groups: Multiple consumers form a consumer group. Kafka assigns each partition to exactly one consumer within a group. This means you can scale consumers up to the number of partitions -- adding more consumers than partitions leaves them idle.
# Kafka producer example using confluent-kafka
from confluent_kafka import Producer
conf = {
'bootstrap.servers': 'broker1:9092,broker2:9092,broker3:9092',
'acks': 'all', # Wait for all replicas to acknowledge
'enable.idempotence': True, # Prevent duplicate messages on retries
'retries': 5,
'retry.backoff.ms': 100,
}
producer = Producer(conf)
def delivery_callback(err, msg):
if err:
print(f"Delivery failed: {err}")
else:
print(f"Delivered to {msg.topic()} [{msg.partition()}] @ {msg.offset()}")
# Partition key ensures all events for the same order go to the same partition
producer.produce(
topic='order-events',
key=b'order-12345',
value=b'{"event": "order_created", "amount": 99.99}',
callback=delivery_callback
)
producer.flush()# Kafka consumer example
from confluent_kafka import Consumer
conf = {
'bootstrap.servers': 'broker1:9092,broker2:9092,broker3:9092',
'group.id': 'order-processing-group',
'auto.offset.reset': 'earliest',
'enable.auto.commit': False, # Manual commit for at-least-once delivery
}
consumer = Consumer(conf)
consumer.subscribe(['order-events'])
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
print(f"Consumer error: {msg.error()}")
continue
# Process the message
process_order_event(msg.value())
# Commit offset only after successful processing
consumer.commit(msg)Replication: Each partition has a configurable replication factor. One replica is the leader (handles all reads and writes), and the others are followers. If the leader fails, a follower is elected as the new leader. The acks=all setting ensures the producer waits for all in-sync replicas to acknowledge the write.
RabbitMQ Architecture
RabbitMQ follows the AMQP protocol with a different model: exchanges, queues, and bindings.
Exchanges receive messages from producers and route them to queues based on rules:
- ›Direct exchange: Routes to queues matching the exact routing key
- ›Fanout exchange: Broadcasts to all bound queues (classic pub/sub)
- ›Topic exchange: Routes based on pattern matching (e.g.,
order.*.created) - ›Headers exchange: Routes based on message header attributes
Queues store messages until consumers acknowledge them. Unlike Kafka, once a message is acknowledged, it is deleted from the queue.
Bindings connect exchanges to queues with optional routing keys.
# RabbitMQ producer with exchange routing
import pika
connection = pika.BlockingConnection(
pika.ConnectionParameters('rabbitmq-host')
)
channel = connection.channel()
# Declare a topic exchange
channel.exchange_declare(exchange='order_events', exchange_type='topic')
# Publish with routing key for topic-based routing
channel.basic_publish(
exchange='order_events',
routing_key='order.payment.completed',
body='{"order_id": "12345", "amount": 99.99}',
properties=pika.BasicProperties(
delivery_mode=2, # Persistent message
content_type='application/json',
)
)
connection.close()Amazon SQS
SQS is AWS's fully managed queue service. It trades flexibility for operational simplicity -- no brokers to manage, no partitions to configure. There are two flavors:
- ›Standard Queues: At-least-once delivery, best-effort ordering, nearly unlimited throughput
- ›FIFO Queues: Exactly-once processing, strict ordering, limited to 3,000 messages/second with batching
SQS is the right choice when you want a simple work queue without managing infrastructure.
Event Sourcing and CQRS
Event Sourcing stores every state change as an immutable event rather than overwriting the current state. Instead of a users table with the latest data, you have an event stream: UserCreated, EmailChanged, AccountDeactivated. The current state is derived by replaying events.
This pairs naturally with message queues -- Kafka's append-only log is essentially an event store.
CQRS (Command Query Responsibility Segregation) separates the write model from the read model. Commands modify state by appending events. Queries read from materialized views optimized for specific access patterns.
Command (Write) --> Event Store (Kafka) --> Event Processor --> Read Database (optimized views)
--> Analytics Service
--> Notification Service
This architecture, combined with the caching strategies from Part 5, enables systems to handle massive read loads while maintaining a clean write path.
Dead Letter Queues
When a message fails processing repeatedly, you do not want it blocking the queue forever. Dead letter queues (DLQs) capture failed messages after a configurable number of retry attempts.
# SQS dead letter queue configuration (AWS CDK)
from aws_cdk import aws_sqs as sqs
from constructs import Construct
class OrderQueueStack(Construct):
def __init__(self, scope, id):
super().__init__(scope, id)
dlq = sqs.Queue(self, "OrderDLQ",
retention_period=Duration.days(14)
)
main_queue = sqs.Queue(self, "OrderQueue",
dead_letter_queue=sqs.DeadLetterQueue(
max_receive_count=3, # Move to DLQ after 3 failed attempts
queue=dlq
)
)Monitor your DLQ size as a critical metric. A growing DLQ indicates a systemic problem -- not just transient failures.
Back-Pressure Handling
When consumers cannot keep up with producers, you need a back-pressure strategy:
- ›Buffering: Let the queue absorb the spike (Kafka excels here with disk-based storage)
- ›Dropping: Discard messages that exceed a threshold (acceptable for metrics, not for orders)
- ›Throttling: Slow down producers when queue depth exceeds a threshold
- ›Scaling: Auto-scale consumers based on queue depth or consumer lag
Delivery Guarantees
- ›At-most-once: Fire and forget. Message may be lost. Fastest but least reliable.
- ›At-least-once: Retry until acknowledged. Message may be delivered multiple times. Requires idempotent consumers.
- ›Exactly-once: Each message processed exactly once. Extremely difficult in distributed systems. Kafka achieves this within its ecosystem using idempotent producers and transactional consumers, but true end-to-end exactly-once across system boundaries requires idempotent downstream operations.
The practical approach: design for at-least-once delivery with idempotent consumers. Use a deduplication key (like an order ID) and check before processing.
Practical Implementation
Here is a pattern for idempotent message processing:
import hashlib
import redis
redis_client = redis.Redis(host='localhost', port=6379, db=0)
DEDUP_TTL = 86400 # 24 hours
def process_message_idempotently(message):
# Generate a unique key for deduplication
dedup_key = f"processed:{message['event_id']}"
# Check if already processed (atomic set-if-not-exists)
if not redis_client.set(dedup_key, "1", nx=True, ex=DEDUP_TTL):
print(f"Duplicate message {message['event_id']}, skipping")
return
try:
# Process the message
handle_order_event(message)
except Exception as e:
# Remove dedup key so message can be retried
redis_client.delete(dedup_key)
raiseThis uses Redis (as we discussed in Part 5: Caching) as a fast deduplication store.
Trade-offs and Decision Framework
| Factor | Kafka | RabbitMQ | SQS |
|---|---|---|---|
| Throughput | Very high (millions/sec) | Moderate (tens of thousands/sec) | High (managed) |
| Ordering | Per-partition | Per-queue | FIFO queues only |
| Message Replay | Yes (retained on disk) | No (deleted after ack) | No |
| Routing Flexibility | Limited (partition key) | Rich (exchanges, bindings) | Limited |
| Operational Overhead | High (ZooKeeper/KRaft) | Medium | None (managed) |
| Use Case | Event streaming, logs, CDC | Task queues, complex routing | Simple cloud-native queues |
Choose Kafka when you need event replay, high throughput, or stream processing. Choose RabbitMQ when you need complex routing, priority queues, or request/reply patterns. Choose SQS when you want a managed solution and do not need message replay or complex routing.
Common Interview Questions
Q: How would you design a system to handle 100,000 order events per second? A: Use Kafka with enough partitions (at least 100) to distribute the load. Partition by order ID for ordering guarantees per order. Use consumer groups with auto-scaling based on consumer lag. Ensure idempotent processing with a deduplication store.
Q: A consumer keeps failing on a specific message. How do you handle it? A: Configure a dead letter queue with a max retry count (e.g., 3 attempts). Failed messages move to the DLQ for investigation. Set up alerts on DLQ depth. Implement a separate DLQ processor that can retry with manual intervention or route to an error-handling workflow.
Q: How do you ensure ordering in a distributed message queue? A: In Kafka, ordering is guaranteed within a partition. Use a consistent partition key (like user ID or order ID) to ensure related messages go to the same partition. If you need global ordering, use a single partition but accept the throughput limitation.
Q: What happens to in-flight messages if a Kafka consumer crashes? A: If using manual offset commits, uncommitted messages will be redelivered to another consumer in the group after the session timeout. This is why at-least-once delivery requires idempotent processing -- the same message may be processed twice during consumer rebalancing.
What's Next
Now that we understand asynchronous communication between services, Part 7: Database Sharding and Partitioning tackles how to split your data layer across multiple machines when a single database can no longer handle the load.
FAQ
When should I use a message queue instead of direct API calls?
Use message queues when the downstream service is slow, unreliable, or when you need to process tasks asynchronously. Examples include sending emails, processing payments, and generating reports.
What is the difference between Kafka and RabbitMQ?
Kafka is a distributed log optimized for high-throughput event streaming and replay. RabbitMQ is a traditional message broker with flexible routing, priority queues, and better support for complex messaging patterns.
How do I guarantee exactly-once message processing?
True exactly-once is nearly impossible in distributed systems. Instead, design for at-least-once delivery with idempotent consumers that safely handle duplicate messages using unique identifiers or deduplication windows.
Collaboration
Need help with a project?
Let's Build It
I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.
Related Articles
Design an E-Commerce Order Processing System
Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.
Monitoring, Observability, and Site Reliability
Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.
CAP Theorem and Distributed Consensus
Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.