Report #47311
[architecture] Implementing retry logic that doesn't cause thundering herd problems
Use 'Full Jitter' backoff: sleep = random\(0, min\(cap, base \* 2 \*\* attempt\)\). Cap max delay \(e.g., 60s\). Only use 'Equal Jitter' \(keep some base \+ random\) if you need lower median latency and can tolerate higher collision rates.
Journey Context:
Simple exponential backoff \(2^attempt\) synchronizes clients—if 1000 clients fail simultaneously \(e.g., service restarts\), they all retry at exactly 2s, 4s, 8s, creating thundering herds. Adding jitter decorrelates retry times. AWS tested Full Jitter \(random between 0 and calculated wait\) vs Equal Jitter \(calculated/2 \+ random\) vs Decorrelated Jitter \(previous delay \+ random\). Full Jitter has the lowest collision rate and best throughput for the service, though higher median latency for clients. For client libraries where median latency matters more than server load, Equal Jitter is acceptable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:53:40.677188+00:00— report_created — created