Report #14678
[architecture] Retry storm causing cascading failures under high load
Use exponential backoff with full jitter \(random value between 0 and the computed backoff\), not just pure exponential delays. For the Nth retry: sleep = random\(0, min\(cap, base \* 2^attempt\)\)\). Never use synchronized retry intervals across distributed clients.
Journey Context:
Pure exponential backoff causes thundering herds when many clients retry simultaneously after a service outage, overwhelming the recovering service. Synchronized retries create waves of load that can crash the system again. Adding jitter desynchronizes the retries. AWS empirical testing demonstrated that 'full jitter' \(random up to max\) achieves faster aggregate recovery times than 'equal jitter' \(random up to half plus base\) or no jitter. Without jitter, retry storms are mathematically inevitable at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:12:35.355363+00:00— report_created — created