Report #64186
[architecture] Preventing thundering herd problems when retrying failed requests in distributed systems
Use exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2 \*\* attempt\)\), rather than simple exponential backoff or equal jitter.
Journey Context:
Simple exponential backoff causes synchronized retries when many clients retry the same failed service simultaneously \(thundering herd\), often overwhelming the recovering service. Adding 'full jitter'—randomizing the sleep time between 0 and the exponential cap—desynchronizes the retries, distributing the load over time. This outperforms 'equal jitter' \(random between half and full value\) in high-contention scenarios. The wrong approach is fixed-interval retries or pure exponential backoff without randomization, which amplifies spikes during recovery.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:13:36.605162+00:00— report_created — created