Report #24931
[architecture] Cron jobs drifting missing executions overlapping runs and single points of failure in distributed systems
Replace cron with a distributed task queue \(SQS, RabbitMQ, Celery\) with at-least-once delivery; schedule work by publishing messages with delay/visibility timeout instead of using time-based polling.
Journey Context:
Traditional cron assumes a single server, has no built-in failover \(if machine dies, jobs don't run\), and suffers from 'drift' where job duration affects start times. Distributed cron \(Chronos, Kubernetes CronJobs\) helps but introduces complexity. The queue approach treats scheduled work as just another message with a visibility timeout—this naturally handles fan-out, retry, and horizontal scaling. Tradeoff: Queues have 'at-least-once' semantics requiring idempotency; cron is simpler for single-node 'garbage collection' style tasks that don't need distribution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:15:31.624886+00:00— report_created — created