System Design: Distributed Rate Limiter
Rate limiting is one of those things that sounds trivial until you need it to work across 50 servers, handle 100k requests per second, and never incorrectly block a paying customer.
Every API you've used has one. Stripe's is 100 requests/second. GitHub's is 5,000/hour. Twitter's is... well, Twitter changes theirs weekly. Let's design one that can handle all of these patterns.
requirements
Functional:
- Limit requests per client (by API key, IP, or user ID)
- Support multiple rate limit rules (e.g., 100/min AND 1000/hour per key)
- Return rate limit headers (X-RateLimit-Remaining, X-RateLimit-Reset)
- Different limits per tier (free vs. paid vs. enterprise)
Non-functional:
- Sub-millisecond decision latency (this sits in the hot path of every request)
- Distributed — works across multiple app servers
- Accurate — no significant over-counting or under-counting
- Graceful degradation — if the rate limiter fails, let traffic through (fail open)
algorithms compared
There are four mainstream algorithms. Each has a clear use case.
Token Bucket: A bucket fills with tokens at a steady rate. Each request consumes one token. If the bucket is empty, reject. Allows bursts up to the bucket capacity.
Leaky Bucket: Requests enter a queue processed at a fixed rate. Smooths traffic but adds latency. Good for APIs that need steady throughput.
Fixed Window Counter: Count requests in fixed time windows (e.g., 12:00:00-12:01:00). Simple but has the boundary problem — a burst at 12:00:59 and 12:01:01 doubles the effective rate.
Sliding Window Log/Counter: Tracks timestamps of recent requests or uses a weighted average across windows. Most accurate but uses more memory.
Token Bucket Fixed Window Sliding Window
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ ●●●●●○○○○○ │ │ Window 1: 78│ │ ╔═══╗ │
│ capacity: 10 │ │ Window 2: 23│ │ ║ ║ slides│
│ refill: 5/s │ │ limit: 100 │ │ ╚═══╝ ────▶ │
│ current: 5 │ └─────────────┘ │ weighted avg│
└─────────────┘ └─────────────┘
Allows bursts Boundary problem Most accurate
For most API rate limiting, sliding window counter hits the sweet spot. It's accurate, memory-efficient, and Redis-friendly. Token bucket wins when you explicitly want to allow controlled bursts (like Stripe does).
high-level architecture
┌──────────┐ ┌───────────────┐ ┌──────────────┐
│ Client │────▶│ Load Balancer │────▶│ API Server │
└──────────┘ └───────────────┘ └──────┬───────┘
│
┌────────▼────────┐
│ Rate Limiter │
│ Middleware │
└────────┬────────┘
│
┌───────────────▼───────────────┐
│ Redis Cluster │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ R-1 │ │ R-2 │ │ R-3 │ │
│ └─────┘ └─────┘ └─────┘ │
└───────────────────────────────┘
│
┌────────▼────────┐
│ Rules Config │
│ (rate limits │
│ per tier) │
└─────────────────┘
The rate limiter runs as middleware in every API server. It checks Redis before the request hits business logic. If Redis is down, fail open — better to serve unrated traffic than return 500s.
the Redis implementation
This is where the design gets real. A sliding window counter in Redis uses sorted sets.
For each client + rule combination, maintain a sorted set where:
- Member = unique request ID (timestamp + random suffix)
- Score = timestamp in milliseconds
On each request:
- Remove all entries older than the window (ZREMRANGEBYSCORE)
- Count remaining entries (ZCARD)
- If count < limit, add the new entry (ZADD) and allow
- If count >= limit, reject with 429
This must be atomic. If steps 1-3 aren't atomic, two concurrent requests could both see count = 99 (limit 100) and both get through. Use a Lua script:
-- Sliding window rate limit (Lua, runs atomically in Redis)
local key = KEYS[1]
local window = tonumber(ARGV[1]) -- window size in ms
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local request_id = ARGV[4]
-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
-- Count current entries
local count = redis.call('ZCARD', key)
if count < limit then
redis.call('ZADD', key, now, request_id)
redis.call('PEXPIRE', key, window)
return {0, limit - count - 1} -- allowed, remaining
else
return {1, 0} -- rejected, 0 remaining
end
Single Redis call, atomic execution, O(log N) for the sorted set operations. At 100k requests per second, this adds ~0.5ms per request.
multiple rate limit rules
Real APIs need layered limits. Stripe enforces both per-second and per-day limits. The middleware evaluates all applicable rules and takes the strictest result.
Rules Configuration Example:
┌────────────────────────────────────────┐
│ Free Tier │
│ ├── 10 requests/second │
│ ├── 1,000 requests/hour │
│ └── 10,000 requests/day │
│ │
│ Pro Tier │
│ ├── 100 requests/second │
│ ├── 50,000 requests/hour │
│ └── 1,000,000 requests/day │
│ │
│ Enterprise Tier │
│ └── Custom (negotiated per contract) │
└────────────────────────────────────────┘
Store rules in a config service (database-backed, cached locally). Each request looks up the client's tier, fetches applicable rules, and runs the Redis check for each. Pipeline the Redis calls — 3 rules = 1 round trip with MULTI/EXEC or Lua.
distributed challenges
Clock skew. If your API servers have clocks that differ by 100ms, two servers will disagree on which window a request falls into. Solution: use Redis's clock (the now value comes from Redis TIME command, not the app server). Alternatively, use NTP and accept small inaccuracies — at the scale of rate limiting, 100ms of skew is negligible.
Redis cluster partitioning. If a client's rate limit key lands on a Redis node that becomes unreachable, you lose the count. Options:
- Fail open (recommended for most cases)
- Use local in-memory counters as a fallback (less accurate but better than nothing)
- Redis Sentinel for automatic failover
Race conditions across data centers. If you have servers in US-East and EU-West, a globally distributed rate limiter needs either a global Redis (added latency) or local Redis with periodic sync (allows brief over-limit). Most companies choose local Redis per region with slightly relaxed limits. Cloudflare's approach: each edge location tracks locally, syncs counts to a central store every few seconds.
response headers
Every well-designed rate limiter returns these headers:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1706918400
Retry-After: 23
This isn't optional — it's how clients build well-behaved retry logic. The Retry-After header prevents thundering herd problems where all rate-limited clients retry simultaneously.
tradeoffs and decisions
Fail open vs. fail closed: If Redis dies, do you block all traffic (safe but catastrophic for availability) or allow all traffic (available but unprotected)? Almost every production system fails open. The rate limiter protects against abuse, not existential threats. If Redis is down for 30 seconds, you'll survive the extra traffic.
Accuracy vs. performance: The sliding window counter is ~99.7% accurate according to Cloudflare's analysis. The 0.3% comes from the approximation in the weighted average. For rate limiting, that's more than good enough. Don't use a sliding window log (which stores every timestamp) unless you need exact counts — it uses 10x more memory.
Client-side rate limiting: Smart clients should implement local rate limiting before hitting the server. This reduces wasted network round trips. Return the rate limit state in headers so clients can self-throttle.
what companies actually use
- Stripe: Token bucket, per-key, implemented in their API gateway. Different buckets for different endpoint groups.
- GitHub: Fixed window with hourly reset. Simple, transparent, well-documented.
- Cloudflare: Sliding window at the edge. Handles billions of requests. Their blog posts on the implementation are excellent reading.
- AWS API Gateway: Token bucket with configurable burst and steady-state rates.
The pattern: simpler algorithms in simpler systems. Stripe and AWS can afford token bucket complexity because it's core to their billing model. GitHub uses fixed windows because it's good enough and easy to explain.
More in System Design
System Design: Distributed Job Scheduler
Designing a cron-at-scale system — priority queues, exactly-once execution, retry with dead letter queues, and the monitoring that keeps it honest.
System Design: File Storage Service
Designing S3-like object storage — chunking, deduplication, CDN integration, and the metadata layer that ties it all together.
System Design: Distributed Cache
Consistent hashing, eviction policies, cache stampede prevention, and the Redis vs Memcached decision you'll actually face in production.