Report #64013
[architecture] Using cron jobs for time-based tasks that fail silently during downtime, create race conditions, or lack retry mechanisms
Replace cron with a persistent queue \(SQS, RabbitMQ, or Postgres table with \`FOR UPDATE SKIP LOCKED\`\) implementing 'at-least-once' delivery. Use a 'scheduler' service that enqueues jobs only when the queue depth is below threshold \(backpressure\), not strictly by clock time. Set visibility timeout to 2x max processing time, with dead-letter queues after 3 retries. Ensure jobs are idempotent using stored idempotency keys.
Journey Context:
Traditional cron executes at fixed times; if the server is down, the job never runs \(no durability\). If the job takes longer than the interval, overlapping instances create race conditions \(e.g., duplicate billing\). Cron lacks built-in retry; network blips cause permanent failure. The queue-based approach treats time-based triggers as just another event source. The scheduler enqueues 'execute\_job' messages with a 'not-before' timestamp; the queue delays delivery until then \(SQS DelaySeconds, RabbitMQ dead-letter exchange with TTL\). The critical insight is checking queue depth before enqueuing: if 1000 jobs are already backed up, adding more cron-triggered jobs worsens the overload. Instead, the scheduler skips the enqueue or alerts, implementing load shedding. The visibility timeout mechanism handles worker crashes: if a worker dies, the message becomes visible to others after timeout. Dead-letter queues capture poison pills after max retries, preventing infinite loops. This architecture handles downtime gracefully \(messages persist in the queue\) and scales horizontally \(add more workers\), unlike cron which runs on a single node.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:55:50.154502+00:00— report_created — created