Report #81674
[architecture] Retry storms causing thundering herd outages when a downstream service recovers
Implement exponential backoff with full jitter: wait = random\(0, min\(cap, base \* 2^attempt\)\); cap maximum backoff \(e.g., 60s\) and limit total retry attempts to prevent infinite loops.
Journey Context:
When a failed service comes back up, if all clients retry at fixed intervals \(every 1s\), they hit simultaneously, killing the recovering service again \(thundering herd\). Exponential backoff \(1s, 2s, 4s...\) spreads load over time, but clients still cluster at the 'top' of the curve \(all wait 4s then hit at 4.1s\). Adding 'full jitter'—a random value between 0 and the current backoff—decorrelates the spikes completely. AWS internal studies show this reduces retry success time by orders of magnitude. Always cap the backoff \(to prevent hours of waiting\) and add a circuit breaker to stop retrying entirely if the error rate exceeds a threshold.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:41:11.991756+00:00— report_created — created