System DesignFebruary 20, 20268 min read

System Design: Real-Time Chat System

system-designarchitecturewebsocketsreal-time

Real-time chat is one of the hardest system design problems because it touches everything — persistent connections, message ordering, delivery guarantees, presence, multi-device sync, and the expectation that it all happens in under 200ms.

WhatsApp handles 100B+ messages per day. Slack manages millions of concurrent WebSocket connections. Let's design something in that ballpark.

requirements

Functional:

1:1 messaging and group chats (up to 500 members)
Real-time message delivery
Sent/delivered/read receipts
Online/offline presence indicators
Message history and search
Multi-device sync
File and image sharing

Non-functional:

End-to-end message latency under 200ms
Messages must never be lost (at-least-once delivery)
Message ordering preserved within a conversation
Support 10M concurrent connections
Offline message delivery when user comes back online

high-level architecture

┌──────────┐        ┌──────────────────┐
│ Client A  │◀──ws──▶│  WebSocket       │
└──────────┘        │  Gateway          │
                    │  (connection mgmt)│
┌──────────┐        │                  │
│ Client B  │◀──ws──▶│  Routes msgs to  │
└──────────┘        │  correct server  │
                    └────────┬─────────┘
                             │
                    ┌────────▼─────────┐
                    │  Message Service   │
                    │  (ordering, store, │
                    │   routing)         │
                    └──┬─────┬─────┬───┘
                       │     │     │
              ┌────────▼┐ ┌──▼───┐ ┌▼────────┐
              │ Cassandra│ │Redis │ │ Presence │
              │ (message │ │(sess-│ │ Service  │
              │  store)  │ │ions) │ │          │
              └─────────┘ └──────┘ └──────────┘

Two critical paths: connection management (keeping millions of WebSockets alive) and message routing (getting a message from sender to recipient in milliseconds).

connection layer: WebSockets

HTTP polling is dead for real-time chat. Long polling is hacky. WebSockets give you a persistent, bidirectional, full-duplex connection. One TCP connection per client, kept alive with heartbeats.

Connection Flow:
┌────────┐                    ┌──────────────┐
│ Client  │── HTTP Upgrade ──▶│  WS Gateway   │
│         │◀── 101 Switch ───│  Server #3    │
│         │                   │               │
│         │◀── heartbeat ────│  (ping/pong   │
│         │── heartbeat ────▶│   every 30s)  │
│         │                   │               │
│         │── message ──────▶│  route to     │
│         │◀── message ──────│  recipient    │
└────────┘                    └──────────────┘

Connection assignment: A load balancer routes the initial WebSocket handshake. Once connected, the client stays on that gateway server. The session registry (Redis) maps user_id → gateway_server:connection_id.

10M concurrent connections: A single server can hold ~50k-100k WebSocket connections (mostly memory-bound). That's 100-200 gateway servers. Use consistent hashing to distribute users across servers. When a server goes down, clients reconnect and get assigned to a new one.

message flow

When User A sends a message to User B:

A's client sends message over WebSocket to Gateway Server 3
Gateway Server 3 publishes to Message Service
Message Service generates a monotonic message ID (for ordering)
Message persisted to Cassandra
Message Service looks up B's connection: Redis says B is on Gateway Server 7
Message routed to Gateway Server 7
Gateway Server 7 pushes message to B's WebSocket
B's client sends "delivered" receipt back through the same path

User A              GW-3          Msg Service        GW-7           User B
  │                  │                │                │               │
  │── send msg ────▶│                │                │               │
  │                  │── publish ───▶│                │               │
  │                  │               │── persist ───▶ DB              │
  │                  │               │── lookup B ──▶ Redis           │
  │                  │               │── route ─────▶│               │
  │                  │               │                │── deliver ──▶│
  │                  │               │                │◀── ack ──────│
  │◀── sent ack ────│◀── ack ──────│◀───────────────│               │

Total hops: 4 network calls. With co-located services, this completes in 50-100ms.

message ordering

This is the subtle hard part. In a 1:1 chat, messages must appear in the order they were sent. But network delays mean Message 2 might arrive at the server before Message 1.

Solution: server-side sequencing. The Message Service assigns each message a monotonically increasing sequence number per conversation. Clients render messages by sequence number, not arrival time.

For group chats, use a per-conversation sequence counter in Redis: INCR chat:{chat_id}:seq. Redis single-threaded execution guarantees uniqueness. The counter becomes the total ordering.

Lamport timestamps for cross-conversation ordering (e.g., sorting the chat list). Each message gets a Lamport timestamp = max(local_clock, last_received_timestamp) + 1. This gives you causal ordering without global synchronization.

presence system

"User is typing..." and online/offline indicators seem simple but scale poorly.

Online/offline: When a user connects, set a key in Redis with a TTL: SET presence:{user_id} online EX 60. The client sends heartbeats every 30 seconds, resetting the TTL. If the heartbeat stops (app closed, network drop), the key expires after 60 seconds and the user appears offline.

Typing indicators: The client sends a "typing" event when the user starts typing. The server fans this out to all members of the conversation. Add a debounce — send "typing" at most once every 3 seconds. Send "stopped typing" after 5 seconds of inactivity.

Presence Architecture:
┌──────────┐     ┌──────────┐     ┌──────────┐
│ Client A  │────▶│  Presence │────▶│  Redis    │
│ heartbeat │     │  Service  │     │  TTL keys │
│ every 30s │     │           │     │  EX 60s   │
└──────────┘     └─────┬────┘     └──────────┘
                       │
                       │ subscribe to friend list
                       ▼
               ┌──────────────┐
               │ Fan-out to    │
               │ online friends│
               │ via WebSocket │
               └──────────────┘

Scaling presence fan-out: If a user with 1,000 friends comes online, you need to notify up to 1,000 connections. For popular users (influencers), this creates a thundering herd. Solution: batch presence updates every 10 seconds instead of real-time. Nobody notices a 10-second delay on a green dot.

read receipts

Three states: sent (server received), delivered (recipient's device received), read (recipient opened the conversation).

Track these with timestamps per message per user:

message_receipts:
┌────────────┬──────────┬───────────┬──────────┐
│ message_id │ user_id  │ delivered │ read_at  │
├────────────┼──────────┼───────────┼──────────┤
│ msg_001    │ user_B   │ 14:30:01  │ 14:30:45 │
│ msg_001    │ user_C   │ 14:30:02  │ NULL     │
└────────────┴──────────┴───────────┴──────────┘

In group chats, this table grows fast. Optimization: only store receipts for the last N messages. Beyond that, assume everything is read. WhatsApp does this — you can't see read receipts for messages older than a few days.

Batching read receipts: When a user opens a chat with 50 unread messages, don't send 50 individual "read" events. Send one: "user B read up to message_id msg_050". The server marks all messages up to that ID as read.

message storage

Cassandra is the right choice here. WhatsApp uses it. Discord used it (before switching to ScyllaDB, which is Cassandra-compatible). The partition key is the conversation ID, and the clustering key is the message sequence number. This gives you ordered reads within a conversation for free.

Cassandra Schema:
┌─────────────────────────────────────────┐
│  Table: messages                         │
│  Partition Key: conversation_id          │
│  Clustering Key: sequence_num (DESC)     │
│                                          │
│  Columns:                                │
│  - message_id (UUID)                     │
│  - sender_id                             │
│  - content (text)                        │
│  - type (text/image/file)                │
│  - created_at (timestamp)                │
│  - edited_at (timestamp, nullable)       │
└─────────────────────────────────────────┘

Pagination: Load the last 50 messages on chat open (one Cassandra partition read). Scroll up to load more. The clustering key ordering makes this a single sequential disk read.

offline message delivery

When User B is offline and receives messages:

Message Service detects B has no active connection (Redis lookup returns null)
Messages are persisted to Cassandra normally
A "pending" counter is incremented in Redis: INCR pending:{user_B}
When B reconnects, the gateway queries pending count and fetches unread messages
Push notification sent via FCM/APNs to wake the device

Multi-device sync: If B has a phone and a laptop, both need the message. The session registry tracks all active connections per user. Messages fan out to all active devices. A "last_read_seq" per device tracks sync state.

key tradeoffs

Push vs. pull for message delivery: Push (server sends to client via WebSocket) is the right choice for real-time chat. Pull (client polls) adds latency and wastes bandwidth. Hybrid: use push for active conversations, pull for catching up on old conversations.

Message encryption: End-to-end encryption (like Signal Protocol) means the server can't read messages. This makes server-side search impossible. You have to choose: encrypted messages with client-only search, or server-readable messages with full-text search. WhatsApp chose encryption. Slack chose searchability.

Group chat fan-out: A message to a 500-person group means 499 delivery operations. Two approaches:

Fan-out on write: Write 499 copies to each member's inbox. Fast reads, expensive writes.
Fan-out on read: Write once, each member queries the group's message stream. Cheap writes, more complex reads.

For chat, fan-out on write wins for groups under 500. Above that (broadcast channels), switch to fan-out on read.

what makes this hard in production

The design above handles the happy path. The hard part is everything else: network partitions where a message is persisted but the delivery ack is lost (duplicate messages), clients that reconnect mid-message (resume from last sequence number), message editing and deletion (propagate to all devices and caches), and the sheer operational burden of keeping 10M WebSocket connections alive across rolling deployments.

Start with a managed WebSocket service (AWS API Gateway WebSocket, Ably, Pusher) unless you have the team to run this infrastructure yourself.