Report #56718
[architecture] Retry storms overwhelming downstream services after transient failures
Implement exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\). Use base=100ms, cap=60s, max attempts 3-5. For high contention, use decorrelated jitter \(sleep = random\(base, sleep\_prev \* 3\)\)
Journey Context:
Simple exponential backoff causes synchronized retries \(thundering herd\) when many clients hit the same failure simultaneously. Adding 'equal jitter' \(random\(target/2, target\)\) helps but full jitter \(random\(0, target\)\) provides better spreading at high percentiles. AWS internal studies show full jitter prevents retry storms in large-scale distributed systems. Must cap maximum delay to prevent excessive latency on long tail failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:41:35.325005+00:00— report_created — created