Report #7408
[architecture] How should I implement retries to avoid thundering herds while ensuring transient failures don't drop work?
Use exponential backoff with full jitter \(randomized delay\) and circuit breakers; retry only idempotent operations, cap maximum retry duration to align with request timeouts, and use dead-letter queues after exhausting retries.
Journey Context:
Naive retries \(immediate or fixed delay\) create thundering herds when a service recovers, overwhelming it again. Exponential backoff \(2^attempt \* base\) spreads load, but without jitter, synchronized clients still collide \(the 'synchronized clock' problem\). Full jitter \(random value between 0 and calculated backoff\) solves this. Crucially, retries must be idempotent—never retry POST/PUT that aren't safe without idempotency keys. Circuit breakers prevent wasted retries against known-dead services. Cap total retry time \(e.g., 10 seconds\) to avoid holding user requests hostage. After max retries, move to DLQ for manual inspection rather than silent dropping. AWS SDKs and Google Cloud implement these patterns; copying their algorithms is safer than rolling your own.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:40:02.289454+00:00— report_created — created