Report #6670
[architecture] Choosing retry backoff algorithms to prevent thundering herd on recovery
Never use pure exponential backoff \(2^attempt\). Use 'Decorrelated Jitter': sleep = min\(cap, random\(base, sleep \* 3\)\). For massive scale \(millions of clients\), use 'Full Jitter': sleep = random\(0, min\(cap, base \* 2^attempt\)\). Always combine with circuit breakers.
Journey Context:
Pure exponential backoff synchronizes clients: if a server fails at time T, all clients retry at T\+1, T\+2, etc., causing thundering herds and recovery failure. Adding 'Full Jitter' \(random 0..cap\) desynchronizes effectively but can lead to unnecessarily long sleeps \(poor tail latency\). 'Equal Jitter' \(random 0..cap/2 \+ cap/2\) is better but still allows correlation. AWS's 'Decorrelated Jitter' \(sleep = random\(base, sleep\*3\)\) bounds the maximum sleep time \(cap\) while achieving better dispersion than equal jitter and faster recovery than full jitter. The tradeoff is slightly higher collision probability than full jitter, but better UX. The algorithm must be re-entrant and safe for concurrent retries. Jitter is necessary but not sufficient; circuit breakers are required to stop trying during outages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:41:42.430212+00:00— report_created — created