System Design: Distributed Job Scheduler
Every non-trivial application runs background jobs. Send a reminder email in 3 hours. Generate reports every Monday at 6 AM. Retry failed payment charges daily. Process uploaded videos in a queue. A single crontab on one server works until it doesn't — and then you need a distributed job scheduler.
Airflow, Temporal, and Celery exist because this problem is harder than it looks. Let's design a scheduler that handles millions of jobs per day, guarantees exactly-once execution, and recovers gracefully from failures.
requirements
Functional:
- Schedule jobs for future execution (one-time and recurring)
- Priority-based execution (critical jobs before batch jobs)
- Retry failed jobs with configurable backoff
- Dead letter queue for permanently failed jobs
- Job dependencies (run B after A completes)
- Dashboard for monitoring job status and history
Non-functional:
- Execute jobs within 1 second of their scheduled time
- Exactly-once execution (never run the same job twice)
- Handle 100k job executions per hour
- Survive node failures without losing scheduled jobs
- Horizontal scalability for workers
high-level architecture
┌──────────────┐ ┌───────────────┐ ┌──────────────┐
│ Clients │────▶│ Scheduler │────▶│ Job Store │
│ (API calls, │ │ API │ │ (Postgres) │
│ cron defs) │ │ │ └──────┬───────┘
└──────────────┘ └───────────────┘ │
│
┌───────────────┐ ┌──────▼───────┐
│ Ticker │────▶│ Priority │
│ (polls for │ │ Queue │
│ due jobs) │ │ (Redis) │
└───────────────┘ └──────┬───────┘
│
┌──────────────────────┼───────┐
│ │ │
┌──────▼───┐ ┌───────▼──┐ ┌▼──────────┐
│ Worker 1 │ │ Worker 2 │ │ Worker 3 │
│ (execute │ │ (execute │ │ (execute │
│ jobs) │ │ jobs) │ │ jobs) │
└──────────┘ └──────────┘ └───────────┘
│
┌────────▼───────┐
│ Dead Letter │
│ Queue (failed │
│ after retries) │
└────────────────┘
Four core components: the Scheduler API (accepts job definitions), the Ticker (finds jobs that are due), the Priority Queue (orders jobs for execution), and the Workers (actually run the jobs).
job data model
Every job has a complete specification that tells the system what to run, when, and what to do if it fails.
Job Record:
┌────────────────────────────────────────────┐
│ job_id: "job_abc123" │
│ type: "send_email" │
│ payload: {to: "user@example.com", │
│ template: "reminder"} │
│ priority: HIGH (1-5 scale) │
│ scheduled_at: 2026-04-03T10:00:00Z │
│ status: PENDING │
│ attempts: 0 │
│ max_retries: 3 │
│ retry_backoff: exponential (30s, 2m, 15m) │
│ timeout: 300s │
│ created_by: "order-service" │
│ cron_expr: NULL (one-time) │
│ depends_on: ["job_xyz789"] │
└────────────────────────────────────────────┘
Job States:
PENDING → QUEUED → RUNNING → SUCCEEDED
│
├──▶ FAILED → RETRY → QUEUED (loop)
│
└──▶ FAILED → DEAD (max retries exceeded)
Recurring jobs: Store the cron expression. When a recurring job completes, the Ticker calculates the next execution time and creates a new PENDING record. The previous execution stays in history.
the Ticker: finding due jobs
The Ticker is a process that periodically scans for jobs whose scheduled_at is in the past and status is PENDING. This is the most critical component — if the Ticker misses a job, it doesn't run.
Ticker Flow (every 1 second):
SELECT * FROM jobs
WHERE scheduled_at <= NOW()
AND status = 'PENDING'
AND (depends_on IS NULL
OR all dependencies SUCCEEDED)
ORDER BY priority ASC, scheduled_at ASC
LIMIT 1000
FOR UPDATE SKIP LOCKED
For each job:
1. Set status = QUEUED
2. Push to Redis priority queue
3. Commit
FOR UPDATE SKIP LOCKED is the key. Multiple Ticker instances can run simultaneously (for availability), and each one locks different rows. No duplicate queueing. This is the Postgres-native way to implement competing consumers.
Why not just poll Redis directly? Because jobs need durable storage. If Redis goes down, you lose the queue. Postgres is the source of truth. Redis is the fast dispatch layer.
priority queue
Jobs enter the queue with a priority score. Workers always pick the highest-priority job first.
Redis Priority Queue (Sorted Set):
Key: job_queue
Score: priority_weight + scheduled_time
Score │ Job ID
───────┼──────────
1.001 │ job_critical_1 (priority 1, urgent)
1.045 │ job_critical_2 (priority 1, less urgent)
2.001 │ job_high_1 (priority 2)
3.500 │ job_normal_1 (priority 3)
5.999 │ job_batch_1 (priority 5, batch)
Worker: BZPOPMIN job_queue 30
→ Returns job_critical_1 (lowest score = highest priority)
The score is priority * 1000 + (scheduled_timestamp / max_timestamp). This ensures priority always wins, with ties broken by scheduled time (FIFO within a priority level).
Multiple queues: For complete isolation, use separate Redis sorted sets per priority level. Workers check the critical queue first, then high, then normal. This prevents a flood of normal jobs from delaying critical ones.
exactly-once execution
The hardest guarantee in distributed systems. What if a worker crashes after starting a job but before reporting completion?
Exactly-Once Strategy:
1. Worker picks job from queue (BZPOPMIN — atomic remove)
2. Worker sets job status = RUNNING with a heartbeat TTL
3. Worker executes the job
4. On success: set status = SUCCEEDED
5. On failure: set status = FAILED, increment attempts
The Heartbeat Problem:
┌──────────┐ ┌──────────┐
│ Worker │────▶│ Job: abc │ heartbeat every 30s
│ (alive) │ │ TTL: 60s │ resets TTL
└──────────┘ └──────────┘
Worker crashes → heartbeat stops → TTL expires
│
┌──────────┐ ┌────▼──────┐
│ Reaper │ ──── detects ────────▶│ Job: abc │
│ Process │ stale job │ status: │
│ │ (TTL expired) │ RUNNING │
└──────────┘ │ → QUEUED │
└───────────┘
Worker heartbeat: While executing, the worker sends a heartbeat every 30 seconds, resetting a 60-second TTL in Redis. If the worker crashes, the heartbeat stops, the TTL expires, and a Reaper process detects the orphaned job and re-queues it.
Idempotent jobs: Even with exactly-once dispatch, the job itself might run twice (worker crashes after execution but before status update). The job's business logic must be idempotent. Sending an email? Check if it was already sent. Processing a payment? Use an idempotency key.
retry and dead letter queues
When a job fails, the retry strategy determines what happens next.
Retry Timeline (exponential backoff with jitter):
Attempt 1: immediate → FAILED
Attempt 2: after 30s → FAILED
Attempt 3: after 2min → FAILED
Attempt 4: after 15min → FAILED (max_retries = 3, exceeded)
│
▼
Dead Letter Queue
┌──────────────┐
│ job_abc123 │
│ failed: 4x │
│ last_error: │
│ "timeout" │
│ payload: ... │
└──────────────┘
│
▼
Alert to ops dashboard
Manual retry or investigation
Jitter is critical. Without it, 1,000 failed jobs all retry at exactly the 30-second mark, creating a spike. Add random jitter: retry_delay = base_delay * 2^attempt + random(0, base_delay).
Dead letter queue (DLQ): Jobs that exhaust all retries land here. The DLQ is a separate table (or queue) that operations teams monitor. Common actions: manual retry after fixing the root cause, discard (for transient issues that resolved), or escalate (for systemic problems).
Circuit breaker: If 50% of jobs of the same type fail in the last 5 minutes, stop executing that type. Something is systemically wrong — maybe a downstream service is down. Keep accepting new jobs but don't dispatch them until the circuit closes.
job dependencies (DAG execution)
Some jobs must run in order. "Generate report" depends on "aggregate data," which depends on "extract raw data." This forms a directed acyclic graph (DAG).
DAG Example:
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Extract │────▶│ Transform │────▶│ Load │
│ Data │ │ Data │ │ Data │
└───────────┘ └─────┬─────┘ └───────────┘
│
┌─────▼─────┐
│ Generate │
│ Report │
└───────────┘
Execution:
1. Extract Data runs (no dependencies)
2. When Extract completes → Transform and Load become eligible
3. Transform completes → Generate Report becomes eligible
4. Load runs independently of Generate Report
Implementation: Each job record has a depends_on array of job IDs. The Ticker only queues a job when all dependencies have status = SUCCEEDED. If a dependency fails and exhausts retries, all downstream jobs are marked CANCELLED.
This is exactly how Airflow DAGs work. Temporal takes a different approach with workflow-as-code, where the DAG is defined in the programming language instead of a configuration format.
monitoring and observability
A job scheduler without monitoring is a time bomb. You need to know:
Dashboard Metrics:
┌────────────────────────────────────────────┐
│ Jobs Executed (last 24h): 142,847 │
│ Success Rate: 99.2% │
│ Avg Execution Time: 4.3s │
│ P99 Execution Time: 47s │
│ Jobs in Queue: 234 │
│ DLQ Depth: 12 ⚠️ │
│ Overdue Jobs (>1min late): 3 ⚠️ │
│ │
│ By Type: │
│ ├── send_email: success 99.8% │
│ ├── process_video: success 94.1% ⚠️ │
│ ├── generate_report: success 99.9% │
│ └── sync_data: success 98.7% │
└────────────────────────────────────────────┘
Alerts to set up:
- DLQ depth > 0 (investigate immediately)
- Job type success rate < 95% (circuit breaker territory)
- Overdue jobs > 10 (Ticker or worker capacity issue)
- Queue depth growing for > 5 minutes (workers can't keep up)
- Worker heartbeat failures (infrastructure problem)
key tradeoffs
Pull vs. push to workers: Pull (workers ask for jobs) is simpler and handles backpressure naturally — slow workers just pull less often. Push (scheduler assigns jobs to workers) enables better load balancing but requires the scheduler to know worker capacity. Pull wins for simplicity.
Postgres vs. dedicated queue: You can build the entire system on Postgres (using SKIP LOCKED as the queue mechanism). This works up to ~10k jobs/hour. Beyond that, the polling overhead becomes significant, and a dedicated queue (Redis, RabbitMQ, SQS) performs better. Start with Postgres, add Redis when you see Ticker scan times increasing.
Exactly-once vs. at-least-once: True exactly-once is nearly impossible in distributed systems. What we really implement is "at-least-once delivery with idempotent handlers." The system might dispatch a job twice (worker crash, network partition), but the job's business logic handles duplicates gracefully. Design for this.
Centralized vs. decentralized scheduling: A central Ticker is a single point of failure. Run multiple Ticker instances with leader election (via Postgres advisory locks or Redis SETNX). Only the leader polls for due jobs. If the leader dies, another instance takes over within seconds.
The best job schedulers are boring. They run millions of jobs without anyone thinking about them. The design goal isn't cleverness — it's reliability.
More in System Design
System Design: File Storage Service
Designing S3-like object storage — chunking, deduplication, CDN integration, and the metadata layer that ties it all together.
System Design: Distributed Cache
Consistent hashing, eviction policies, cache stampede prevention, and the Redis vs Memcached decision you'll actually face in production.
System Design: Payment System
Idempotency, double-entry bookkeeping, webhook handling, and PCI compliance — the system where bugs cost real money.