Report #21617
[architecture] Retry storms causing cascading failures in distributed systems
Use Decorrelated Exponential Backoff: sleep = min\(cap, random\(0, base \* 2^attempt\)\) for attempt 0,1,2... This separates retry timing better than Equal Jitter under high contention.
Journey Context:
Simple exponential backoff causes thundering herd when many clients retry simultaneously \(synchronized retries\). Adding 'Full Jitter' \(sleep = random\(0, base \* 2^attempt\)\) helps but has high median latency. 'Equal Jitter' \(sleep = base\*2^attempt/2 \+ random\(0, base\*2^attempt/2\)\) is better for steady-state but still allows synchronization under high load. AWS research shows Decorrelated Jitter provides the best balance of low median latency and low synchronization probability under massive retry storms.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:41:49.703553+00:00— report_created — created