Report #53789
[architecture] Implementing retry logic that avoids thundering herd on service recovery
Use exponential backoff with 'full jitter': sleep = random\(0, min\(cap, base \* 2^attempt\)\). This decorrelates retry times across clients, preventing synchronized traffic spikes that overwhelm recovering services.
Journey Context:
Simple exponential backoff \(sleep = base \* 2^attempt\) causes 'thundering herd' problems: when a service fails, all clients retry at identical intervals \(1s, 2s, 4s...\), creating synchronized traffic waves that crash the service again. Adding 'equal jitter' \(sleep = cap/2 \+ random\(0, cap/2\)\) helps, but 'full jitter' \(random between 0 and cap\) provides the best smoothing by maximizing entropy in retry timing. AWS SDKs use this pattern to handle massive scale failures without amplification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:46:51.987526+00:00— report_created — created