Blog/System Design/Design a Real-Time Chat System Like WhatsApp
POST
September 05, 2025
LAST UPDATEDSeptember 05, 2025

Design a Real-Time Chat System Like WhatsApp

Design a scalable real-time chat system with WebSockets, message delivery guarantees, read receipts, group chats, and offline message synchronization strategies.

Tags

System DesignChat SystemWebSocketsReal-Time
Design a Real-Time Chat System Like WhatsApp
12 min read

Design a Real-Time Chat System Like WhatsApp

This post applies concepts from the System Design from Zero to Hero series.

TL;DR

A real-time chat system uses persistent WebSocket connections for instant messaging, a message queue for delivery guarantees, and sequence IDs for ordering and offline sync. The architecture centers around a connection gateway that manages millions of WebSocket sessions, a chat service that handles message routing and persistence, and a presence service that tracks online/offline status. Group chat introduces fan-out challenges that require careful trade-offs between write amplification and read latency.

Requirements

Functional Requirements

  1. One-on-one messaging — Users can send text messages to each other in real time.
  2. Group chat — Support group conversations with up to 500 members.
  3. Delivery receipts — Show sent, delivered, and read status for each message.
  4. Presence — Display online/offline status and "last seen" timestamps.
  5. Offline message sync — Users receive all missed messages when they come back online.
  6. Media sharing — Support images, videos, and documents (we will focus on the messaging layer, not the media upload pipeline).

Non-Functional Requirements

  1. Real-time delivery — Messages should arrive within 200ms for online users.
  2. Ordering guarantee — Messages within a conversation must appear in the correct order.
  3. Durability — No messages should be lost, even during server failures.
  4. Scale — Support 500 million active users with 50 million concurrent connections.
  5. High availability — The messaging service must be available 99.99% of the time.

Back-of-Envelope Estimation

Assume 500 million daily active users, each sending an average of 40 messages per day:

  • Total messages per day: 500M * 40 = 20 billion messages/day
  • Messages per second: 20B / 86,400 ≈ ~230,000 messages/second (average)
  • Peak traffic: 3x average ≈ ~700,000 messages/second
  • Concurrent connections: 50 million WebSocket connections
  • Per-server capacity: ~50,000 WebSocket connections per server → need ~1,000 gateway servers
  • Storage per message: ~200 bytes average → 20B * 200 bytes = ~4 TB/day
  • Storage over 5 years: ~7.3 PB (with hot/cold tiering, most data moves to cheap storage)

High-Level Design

Clients ↔ WebSocket Gateway Cluster ↔ Chat Service ↔ Message Store
                 ↕                        ↕
          Connection Registry        Message Queue (Kafka)
                 ↕                        ↕
          Presence Service          Notification Service

Message send flow:

  1. User A's client sends a message through its WebSocket connection to a gateway server.
  2. The gateway forwards the message to the Chat Service.
  3. The Chat Service persists the message to the Message Store with a sequence number.
  4. The Chat Service looks up User B's gateway server in the Connection Registry.
  5. If User B is online, the message is forwarded to their gateway server and pushed down their WebSocket.
  6. If User B is offline, the message is queued for delivery when they reconnect.
  7. An analytics/notification event is published to Kafka for push notification delivery.

Detailed Design

WebSocket Connection Management

WebSockets provide full-duplex communication, which is essential for chat. Unlike HTTP polling or Server-Sent Events, WebSockets let both client and server push data at any time over a single persistent TCP connection. For more on connection management at scale, see Part 3: Load Balancing.

Connection Gateway Architecture:

Each gateway server manages tens of thousands of WebSocket connections. When a user connects:

  1. The client opens a WebSocket to a gateway server (assigned by the load balancer).
  2. The gateway authenticates the connection using a JWT token.
  3. The gateway registers the mapping user_id → gateway_server_id in a distributed Connection Registry (Redis cluster).
  4. The gateway starts a heartbeat ping/pong cycle (every 30 seconds) to detect dead connections.
  5. On disconnect, the gateway removes the user from the Connection Registry and updates the Presence Service.

Load balancing considerations: Use Layer 4 (TCP) load balancing for WebSocket connections, not Layer 7 (HTTP). Layer 7 load balancers can interfere with the WebSocket upgrade handshake. Sticky sessions are not required because the Connection Registry decouples message routing from connection assignment.

Message Ordering with Sequence Numbers

Message ordering is deceptively difficult in a distributed system. Wall-clock timestamps are unreliable across servers and devices. Instead, use per-conversation monotonically increasing sequence numbers.

How it works:

  1. Each conversation maintains a last_sequence_id counter.
  2. When a message is sent, the Chat Service atomically increments the counter and assigns the new value to the message.
  3. Clients display messages sorted by sequence number within each conversation.
  4. During offline sync, the client sends its last_seen_sequence_id and the server returns all messages with a higher sequence number.
python
# Atomic sequence assignment (pseudocode)
def send_message(conversation_id: str, sender_id: str, content: str):
    # Atomic increment — use Redis INCR or database row-level lock
    seq_id = redis.incr(f"conv:{conversation_id}:seq")
 
    message = {
        "id": generate_uuid(),
        "conversation_id": conversation_id,
        "sender_id": sender_id,
        "content": content,
        "sequence_id": seq_id,
        "timestamp": now(),
        "status": "SENT"
    }
 
    message_store.insert(message)
    route_to_recipients(conversation_id, message)
    return message

For high-throughput conversations (busy group chats), a single Redis key per conversation becomes a bottleneck. In that case, use a database-level auto-increment within a partitioned message table, where each conversation maps to a single partition.

Delivery Receipts: Sent, Delivered, Read

The three-tick system (sent → delivered → read) requires tracking message status transitions:

  • Sent — The server has persisted the message. The server sends an acknowledgment back to the sender's client.
  • Delivered — The recipient's device has received the message. The recipient's client sends a delivery acknowledgment back to the server, which forwards it to the sender.
  • Read — The recipient has opened the conversation. The recipient's client sends a read receipt when the message becomes visible on screen.
Sender → Server: "Here is a message"
Server → Sender: "ACK: message stored" (✓ sent)
Server → Recipient: "New message for you"
Recipient → Server: "ACK: received" (✓✓ delivered)
Recipient → Server: "Read up to seq_id 42" (✓✓ read, blue ticks)
Server → Sender: "Recipient read up to seq_id 42"

Optimization for read receipts: Instead of sending a read receipt per message, the client sends "I have read all messages up to sequence_id X in conversation Y." This single event marks all prior messages as read, reducing network traffic dramatically.

Presence Service: Online/Offline/Last Seen

Tracking presence for hundreds of millions of users is expensive if done naively. A heartbeat-based approach works well:

  1. When a user connects, set their status to online in a Redis hash with a TTL of 60 seconds.
  2. Every 30 seconds, the client sends a heartbeat. The server refreshes the TTL.
  3. If the TTL expires (no heartbeat for 60 seconds), the user is considered offline. Their last_seen timestamp is recorded.
  4. When a user opens a conversation, the client fetches the presence status of the other participant(s).

Fan-out of presence updates: Do not broadcast presence changes to all contacts. That creates a thundering herd problem when popular users go online/offline. Instead, use a pull model: when User A opens a chat with User B, User A's client queries User B's presence. For the friend list screen, batch-fetch presence for visible contacts only.

For the pub/sub mechanisms that power real-time presence subscriptions, see Part 6: Message Queues and Event-Driven Architecture.

Group Chat Fan-Out

Group chat introduces a fan-out problem: when a user sends a message to a group of 200 members, how do you deliver it to all 200?

Fan-out on write:

When a message arrives, the server writes a copy to each recipient's inbox. This pre-computes delivery, making reads fast (each user just reads their own inbox). The downside is write amplification: a single message to a 200-member group creates 200 writes.

Fan-out on read:

The message is stored once in the group's message table. When a user opens the group chat, they query the group's messages directly. Reads are slightly slower (fetching from a shared table), but writes are minimal.

Hybrid approach (recommended):

  • For small groups (under 50 members), use fan-out on write. The write amplification is manageable, and reads are instant.
  • For large groups (50+ members), use fan-out on read. Store messages once in the group table and let clients query it.
  • For channels/broadcast groups (thousands of members), always fan-out on read with aggressive caching.

Message Storage: Hot/Cold Tiering

Chat history follows a clear access pattern: recent messages are accessed frequently, while older messages are rarely read. This is a textbook case for hot/cold storage tiering.

  • Hot storage (last 30 days): Store in a fast database like Cassandra or DynamoDB, partitioned by conversation_id. This is the primary read path.
  • Warm storage (30 days to 1 year): Move to a cheaper storage tier, still queryable but with higher latency. Compressed and stored on SSDs.
  • Cold storage (1+ years): Archive to object storage (S3). Only loaded on demand when a user scrolls back through very old messages.

A background job periodically migrates messages from hot to warm to cold storage based on age.

End-to-End Encryption Concepts

In end-to-end encryption (E2E), the server never sees plaintext messages. The basic flow:

  1. Each user generates a public/private key pair on their device.
  2. Public keys are uploaded to a Key Distribution Server.
  3. When User A messages User B, User A's client fetches User B's public key, encrypts the message, and sends the ciphertext to the server.
  4. The server stores and forwards the ciphertext without being able to decrypt it.
  5. User B's client decrypts with their private key.

Group E2E encryption is more complex. The Signal Protocol uses a "sender key" approach where each group member generates a sender key, distributes it to all other members, and encrypts group messages with that key. Key rotation happens when members leave the group.

Note that E2E encryption means the server cannot perform server-side search on message content, spam filtering on encrypted messages, or link previews. These features must happen on the client device.

Data Model

Messages Table (Cassandra — partitioned by conversation_id)

ColumnTypeDescription
conversation_idUUIDPartition key
sequence_idBIGINTClustering key (ascending)
message_idUUIDGlobally unique message ID
sender_idUUIDUser who sent the message
contentTEXTMessage body (encrypted if E2E)
message_typeVARCHARtext / image / video / document
statusVARCHARsent / delivered / read
created_atTIMESTAMPServer-side timestamp

Conversations Table

ColumnTypeDescription
conversation_idUUIDPrimary key
typeVARCHARone_on_one / group
nameVARCHARGroup name (null for 1:1)
created_byUUIDUser who created the conversation
created_atTIMESTAMPCreation time
last_sequence_idBIGINTLatest sequence number
last_message_atTIMESTAMPUsed for sorting conversation list

User Conversations Table (for each user's conversation list)

ColumnTypeDescription
user_idUUIDPartition key
conversation_idUUIDClustering key
last_read_seq_idBIGINTLast sequence ID the user has read
unread_countINTNumber of unread messages
muted_untilTIMESTAMPMute expiration

Presence Table (Redis Hash)

Key: presence:{user_id}
Fields:
  status: "online" | "offline"
  last_seen: timestamp
TTL: 60 seconds (auto-expire to offline)

Scaling Considerations

WebSocket gateway scaling: Add more gateway servers behind a TCP load balancer. Each server handles ~50K connections. The Connection Registry (Redis) decouples message routing from server assignment, so any gateway can be added or removed without disrupting other connections.

Chat Service scaling: The Chat Service is stateless. Scale horizontally behind an internal load balancer. The only coordination point is the sequence number counter, which is partitioned per conversation.

Message Store scaling: Cassandra scales linearly by adding nodes. Partition by conversation_id so all messages in a conversation live on the same node, enabling efficient range queries (fetching the last 50 messages).

Cross-region deployment: For a global user base, deploy gateway servers in multiple regions. Use a geo-DNS service to route users to the nearest region. Messages between users in different regions are routed through an inter-region message bus.

Trade-offs and Alternatives

DecisionOption AOption BRecommendation
TransportWebSocketLong pollingWebSocket for full-duplex real-time
OrderingTimestampsSequence IDsSequence IDs for guaranteed ordering
Group fan-outFan-out on writeFan-out on readHybrid based on group size
Message storeCassandraDynamoDBCassandra for its excellent partition-based range queries
PresencePush modelPull modelPull to avoid thundering herd
EncryptionServer-sideEnd-to-endE2E for privacy, server-side if search needed

Why not HTTP long polling? Long polling requires the client to repeatedly open new HTTP connections. Each connection consumes a file descriptor, adds TLS handshake overhead, and introduces latency between the poll timeout and the next request. WebSockets maintain a single persistent connection with sub-millisecond delivery latency.

Why Cassandra over a relational database? Chat messages have a write-heavy append pattern and read access is almost exclusively by conversation (partition key). Cassandra's log-structured storage engine and partition-based data model are a natural fit. A relational database would require sharding to achieve the same write throughput, adding operational complexity.

FAQ

How do chat applications maintain millions of WebSocket connections?

Each server handles tens of thousands of connections. A connection gateway layer routes messages between users on different servers, often using an in-memory pub/sub system like Redis to locate which server holds each user's connection. The gateway servers are stateless in the sense that they hold only transient connection state. The Connection Registry (a Redis cluster) maintains the mapping of user IDs to gateway servers, enabling any service to find where a user is connected and route a message to the correct gateway.

How does WhatsApp guarantee message delivery when users are offline?

Messages are stored persistently on the server with delivery status tracking. When the recipient comes online, pending messages are pushed in order using sequence numbers. The client acknowledges receipt to update delivery status. The key mechanism is the per-conversation sequence number: the client reports its last_seen_sequence_id upon reconnection, and the server sends all messages with higher sequence numbers. This ensures no messages are missed even if the user was offline for days.

Should I use WebSockets or Server-Sent Events for a chat system?

WebSockets are better for chat because they provide full-duplex communication, allowing both sending and receiving on one connection. SSE is unidirectional and would require a separate channel for sending messages. With SSE, the client would need to use regular HTTP POST requests to send messages while receiving them through the SSE stream. This creates two separate connections and adds complexity to the protocol. WebSockets give you a single bidirectional channel that maps naturally to the chat use case.

Collaboration

Need help with a project?

Let's Build It

I help startups and established companies design, build, and scale world-class digital products. From deep technical architecture to pixel-perfect UI — let's bring your vision to life.

SH

Article Author

Sadam Hussain

Senior Full Stack Developer

Senior Full Stack Developer with over 7 years of experience building React, Next.js, Node.js, TypeScript, and AI-powered web platforms.

Related Articles

Design an E-Commerce Order Processing System
Jan 10, 202612 min read
System Design
E-Commerce
Saga Pattern

Design an E-Commerce Order Processing System

Design a fault-tolerant e-commerce order system with inventory management, payment processing, saga pattern for transactions, and event-driven order fulfillment.

Monitoring, Observability, and Site Reliability
Dec 10, 20259 min read
System Design
Observability
Monitoring

Monitoring, Observability, and Site Reliability

Build observable systems with structured logging, distributed tracing, and metrics dashboards. Learn SRE practices including SLOs, error budgets, and incident response.

CAP Theorem and Distributed Consensus
Nov 12, 202510 min read
System Design
CAP Theorem
Distributed Systems

CAP Theorem and Distributed Consensus

Understand the CAP theorem, its practical implications, and distributed consensus algorithms like Raft and Paxos. Learn how real databases handle partition tolerance.