Report #30673
[architecture] Thundering herd problem when thousands of clients retry failed API calls simultaneously
Implement full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\); use base=100ms, cap=60s
Journey Context:
When a service fails, all clients retry at fixed intervals \(e.g., every 1s\), creating synchronized traffic spikes that overwhelm the recovering service. Simple exponential backoff \(2^N\) still causes synchronization because clients tend to cluster at the max delay. Full jitter spreads retries uniformly across the interval \[0, delay\], preventing correlation. Equal jitter \(delay/2 \+ random\(delay/2\)\) reduces variance but full jitter is safest for massive scale. Cap prevents infinite growth. Essential for any client SDK calling external APIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:52:09.446760+00:00— report_created — created