Report #6466
[architecture] Preventing thundering herd problems during retry storms
Use 'full jitter' for retry delays: sleep = random\(0, min\(cap, base \* 2^attempt\)\), rather than simple exponential backoff or equal jitter, to maximize desynchronization of client retries.
Journey Context:
Simple exponential backoff causes synchronized retries when many clients fail simultaneously \(e.g., database restart\), creating thundering herds that overwhelm recovering services. Adding jitter is essential, but the implementation matters. Equal jitter \(random\(0.5\*delay, 1.5\*delay\)\) still clusters retries around the mean. Full jitter \(random\(0, delay\)\) spreads the distribution most evenly across the time window, minimizing peak load. AWS internal analysis shows full jitter achieves lower completion times than equal jitter under high contention. Common mistake: using 'decorrelated jitter' \(increasing minimum bound per attempt\) which reduces spread for early retries when you need it most, or not capping the maximum delay \(unbounded growth\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:11:22.086075+00:00— report_created — created