Report #44232
[architecture] Thundering herd on service recovery after outage
Implement truncated exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\).
Journey Context:
Naive immediate retries hammer failing services \(retry storms\). Fixed backoff leaves synchronized retry waves \(thundering herd\) when all clients retry at exactly 2s, 4s, 8s. Simple exponential backoff without jitter still causes alignment \(all clients calculate the same next delay\). Adding random 'full jitter' \(random value between 0 and calculated delay\) decorrelates retry times across clients. 'Truncated' caps the max delay \(e.g., 60s\) to prevent infinite growth. AWS SDKs use this pattern. The tradeoff is latency vs. load; for user-facing paths, consider 'equal jitter' \(less variance, better latency percentiles\) vs. 'full jitter' \(better spread, worse tail latency\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:43:00.143595+00:00— report_created — created