System DesignFebruary 14, 20267 min read

System Design: Notification System

system-designarchitecturemessagingqueues

Notifications are the nervous system of every modern application. Slack sends billions per day. Uber sends a push notification the moment your driver arrives. Your bank sends an SMS when a large transaction hits. Behind each of these is a system that needs to be fast, reliable, and smart enough not to annoy users into disabling everything.

Let's design a notification system that handles 1B+ notifications per day across push, email, and SMS.

requirements

Functional:

Support push notifications (iOS/Android), email, SMS, and in-app
Template engine for dynamic content (e.g., "Your order shipped")
User preference management (opt-in/out per channel, quiet hours)
Priority levels (critical, high, normal, low)
Delivery tracking and retry logic

Non-functional:

Critical notifications (2FA codes, security alerts) delivered within 5 seconds
Normal notifications within 60 seconds
At-least-once delivery guarantee
Handle 10k+ notifications per second sustained
Rate limiting to prevent notification fatigue

high-level architecture

┌──────────────┐     ┌──────────────┐     ┌───────────────────┐
│  Services     │────▶│  Notification │────▶│  Priority Router   │
│  (triggers)   │     │  API          │     │                   │
│  - Orders     │     │  - validate   │     │  ┌─── critical ──▶ │
│  - Payments   │     │  - template   │     │  ├─── high ──────▶ │
│  - Auth       │     │  - enrich     │     │  ├─── normal ────▶ │
└──────────────┘     └──────────────┘     │  └─── low ────────▶ │
                                           └─────────┬─────────┘
                                                     │
                      ┌──────────────────────────────┼──────────┐
                      │            Kafka              │          │
                      │  ┌────────┐ ┌────────┐ ┌────────┐      │
                      │  │ push   │ │ email  │ │  sms   │      │
                      │  │ topic  │ │ topic  │ │ topic  │      │
                      │  └───┬────┘ └───┬────┘ └───┬────┘      │
                      └──────┼──────────┼──────────┼───────────┘
                             │          │          │
                      ┌──────▼───┐ ┌────▼────┐ ┌──▼──────┐
                      │  Push    │ │  Email   │ │  SMS    │
                      │  Worker  │ │  Worker  │ │  Worker │
                      │  (FCM/   │ │  (SES/   │ │ (Twilio/│
                      │   APNs)  │ │  Sendgrid)│ │  SNS)  │
                      └──────────┘ └──────────┘ └─────────┘

The key insight: decouple the notification request from the delivery. Services don't know or care how a notification gets delivered. They fire an event, and the notification system figures out the rest.

the notification API

Every notification request includes: who (user ID), what (template + data), and why (event type). The system figures out the how (which channels).

POST /api/notifications
{
  "user_id": "u_12345",
  "template": "order_shipped",
  "data": {
    "order_id": "ORD-789",
    "tracking_url": "https://track.example.com/789",
    "eta": "2026-02-16"
  },
  "priority": "normal",
  "event_type": "order.shipped"
}

The API layer handles:

Validation — is this a real user, real template?
Preference lookup — does the user want this notification? On which channels?
Template rendering — merge data into the template
Channel routing — send to push, email, SMS, or all three
Deduplication — don't send the same notification twice in 5 minutes

template engine

Hard-coding notification text in every service is a maintenance nightmare. Templates solve this.

Template: order_shipped
├── push:  "Your order {{order_id}} is on the way!"
├── email: Full HTML template with tracking button
├── sms:   "Order {{order_id}} shipped. Track: {{tracking_url}}"
└── in-app: "Order shipped — arriving {{eta}}"

Store templates in a database. Marketing and product teams edit them through an admin UI. Engineers never touch notification copy after initial setup.

Template versioning matters. When you update a template, in-flight notifications should use the version they were created with, not the latest. Stamp each notification with the template version at creation time.

priority queues

Not all notifications are equal. A 2FA code needs to arrive in seconds. A marketing email can wait minutes.

Priority Levels:
┌──────────┬────────────┬──────────────────────────┐
│ Priority │ Target SLA │ Examples                 │
├──────────┼────────────┼──────────────────────────┤
│ Critical │ < 5s       │ 2FA codes, security      │
│          │            │ alerts, password reset    │
│ High     │ < 30s      │ Payment confirmation,     │
│          │            │ order updates             │
│ Normal   │ < 60s      │ Social interactions,      │
│          │            │ shipping updates          │
│ Low      │ < 5min     │ Marketing, weekly digest, │
│          │            │ recommendations           │
└──────────┴────────────┴──────────────────────────┘

Implementation: separate Kafka topics per priority, with different consumer group configurations. Critical topics get more consumers and higher poll frequencies. Low-priority topics batch messages for efficiency.

delivery guarantees and retry logic

Notifications use at-least-once delivery. It's better to send a duplicate than to silently drop a 2FA code.

Retry Strategy (per channel):
┌─────────────────────────────────────────────┐
│  Attempt 1 → immediate                      │
│  Attempt 2 → after 30 seconds               │
│  Attempt 3 → after 2 minutes                │
│  Attempt 4 → after 15 minutes               │
│  Attempt 5 → after 1 hour                   │
│  Failed → Dead Letter Queue (DLQ)           │
│                                              │
│  Exponential backoff with jitter             │
│  Max retries vary by priority:               │
│    Critical: 10 retries over 4 hours        │
│    Normal:   5 retries over 2 hours         │
│    Low:      3 retries over 30 minutes      │
└─────────────────────────────────────────────┘

Each delivery attempt logs the result. If FCM returns a NotRegistered error, don't retry — remove the device token. If SendGrid returns a 5xx, retry with backoff. If Twilio says the number is invalid, mark it and alert the user to update their phone number.

The dead letter queue catches everything that exhausts retries. An ops dashboard monitors DLQ depth. A spike means something is broken — provider outage, bad template, token rotation needed.

user preferences

This is where most notification systems fail. Users get spammed, disable everything, and you've lost the channel forever.

Preference Model:
┌────────────────────────────────────────────┐
│  User: u_12345                             │
│  ├── Global: notifications ON              │
│  ├── Quiet Hours: 10pm - 8am (PST)        │
│  ├── Channels:                             │
│  │   ├── push: ON                          │
│  │   ├── email: ON                         │
│  │   └── sms: CRITICAL ONLY               │
│  └── Categories:                           │
│      ├── order_updates: push + email       │
│      ├── marketing: email only             │
│      ├── security: all channels            │
│      └── social: push only                 │
└────────────────────────────────────────────┘

Security notifications bypass preferences — you always send a 2FA code regardless of quiet hours. Everything else respects the user's settings. Store preferences in a fast cache (Redis) with a database backing store.

Rate limiting per user: No user should receive more than X notifications per hour from a single category. This catches runaway loops (the classic "your friend liked your photo" spam).

multi-channel orchestration

When a notification goes to multiple channels, coordinate them:

Send push first (instant delivery, cheapest)
If push fails or user doesn't open within 5 minutes, send email
If email bounces or no open within 1 hour, send SMS (most expensive, highest open rate)

This is called channel fallback or cascade delivery. It maximizes reach while minimizing cost and annoyance. Slack does something similar — they only email you about messages if you haven't been active in the app.

scaling considerations

Push notifications: FCM (Firebase Cloud Messaging) handles fan-out to Android devices. APNs (Apple Push Notification Service) for iOS. Both support batch sending. Your push worker maintains persistent connections to both services.

Email at scale: Use Amazon SES or SendGrid. Warm up IP addresses gradually to avoid spam filters. Maintain separate IP pools for transactional and marketing email. Monitor bounce rates — ISPs will block you if bounce rates exceed 5%.

SMS: Most expensive channel (~$0.01/message). Use it sparingly. Twilio or AWS SNS for delivery. Always include opt-out instructions (legally required in most countries).

Horizontal scaling: Each channel worker scales independently. During Black Friday, you might need 10x email workers but only 2x push workers. Kafka consumer groups make this trivial — add more workers to the consumer group, and Kafka rebalances partitions automatically.

what I'd do differently at different scales

10k notifications/day: Skip Kafka. Use a simple job queue (BullMQ + Redis). One worker per channel. Monolith is fine.

10M notifications/day: Add Kafka. Separate workers. Priority queues. Basic retry logic. Template engine.

1B notifications/day: Everything above plus multi-region deployment, channel fallback orchestration, ML-driven send-time optimization (send when users are most likely to engage), and a dedicated deliverability team monitoring ISP relationships.

The system grows with the business. Don't build the billion-scale system on day one.