Report #8507
[architecture] Exponential backoff causing thundering herd issues when services recover
Use 'Full Jitter' or 'Equal Jitter' backoff: calculate base sleep = min\(cap, base \* 2^attempt\), then apply random jitter \(full jitter: random\(0, sleep\); equal jitter: sleep/2 \+ random\(0, sleep/2\)\). This decorrelates retry storms.
Journey Context:
Naive exponential backoff \(2^attempt\) causes all clients to retry at exactly the same intervals when a service comes back up, creating synchronized waves of traffic that re-kill the service. AWS popularized jitter in the 2015 'Exponential Backoff And Jitter' blog post. Full Jitter spreads retries uniformly across the interval but can cause very short sleeps early on; Equal Jitter keeps half the backoff deterministic for minimum latency while randomizing the other half. For high-throughput systems, consider 'Decorrelated Jitter' \(sleep = min\(cap, random\(base, sleep\*3\)\)\) which provides better throughput under high contention per AWS research.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:41:52.680933+00:00— report_created — created