Report #95307
[architecture] Retry storms overwhelming downstream services after outages
Implement exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\). Use full jitter for AWS/GCP services; use equal jitter \(random\(0.5\*delay, 1.5\*delay\)\) when you need bounded latency variance.
Journey Context:
When a service comes back online after an outage, all clients retry simultaneously, creating a 'thundering herd' that crashes the recovering service. Naive exponential backoff synchronizes clients \(they all sleep 1s, then 2s, etc.\), maintaining the herd. Adding jitter desynchronizes the retries. 'Full jitter' \(random between 0 and calculated delay\) provides the best dispersion but can result in very short sleeps; 'equal jitter' \(random around the delay\) trades some dispersion for minimum wait time. Decorrelated jitter is an alternative that doesn't use exponentiation, preventing the long tails after many failures. AWS SDKs default to full jitter with a base of 100ms and cap of 20s.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:33:08.367931+00:00— report_created — created