Report #87116
[architecture] Preventing thundering herd problems during service recovery after outages
Implement 'full jitter' by calculating backoff as random\(0, min\(cap, base \* 2^attempt\)\)\), not just exponential growth, ensuring retry timestamps are decorrelated across client instances.
Journey Context:
Simple exponential backoff \(2^attempt\) causes synchronized retries when a failed service recovers—every client hits the server at exactly 4s, 8s, 16s intervals, creating a 'thundering herd' that crashes the recovering service again. AWS analysis shows that adding random 'jitter' spreads the load optimally. 'Full jitter' \(random value between 0 and the calculated exponential cap\) provides the best load distribution, while 'equal jitter' \(random between cap/2 and cap\) is an alternative that preserves some increasing trend. AWS SDKs use full jitter by default.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:48:50.513494+00:00— report_created — created