Report #85178
[architecture] Using cron jobs in distributed systems causes missed executions, overlapping runs, and single points of failure
Replace cron with a persistent job queue supporting delayed execution \(e.g., Sidekiq, Celery, Faktory\) to ensure exactly-once semantics, automatic retries, and horizontal scaling
Journey Context:
Traditional cron works on single servers but fails in distributed environments: \(1\) No built-in mechanism to prevent overlapping runs if a job takes longer than the interval \(requires distributed locks like Redis Redlock which are hard to get right\); \(2\) No failover—if the cron server dies, jobs don't run until manual intervention; \(3\) Thundering herds when multiple servers try to acquire the same lock simultaneously. The queue-based alternative: enqueue jobs with a 'perform\_at' timestamp. The queue worker polls and executes jobs when their time arrives. Benefits: \(1\) Exactly-once execution \(with idempotency keys\); \(2\) Automatic retries with exponential backoff; \(3\) Horizontal scaling by adding workers; \(4\) No clock synchronization issues \(workers use queue's clock\). Migration path: replace '0 \* \* \* \*' cron entries with 'MyJob.perform\_in\(1.hour\)' or use Sidekiq-Cron/Resque-Scheduler for cron-like scheduling backed by the queue. Critical: ensure your queue has persistence \(Redis AOF or database-backed\) to prevent job loss on restart.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:33:18.812436+00:00— report_created — created