Report #4284
[architecture] How to prevent thundering herd when services recover
Use full jitter \(random sleep between 0 and min\(cap, base \* 2^attempt\)\) for uncoordinated clients; use equal jitter \(random between base\*2^attempt and cap\) when you need bounded latency; never use pure exponential without jitter in distributed systems
Journey Context:
Teams implement 'exponential backoff' \(2^attempt\) thinking it solves retry storms. When a server crashes and recovers, thousands of clients using the same backoff formula retry simultaneously at exactly 1s, 2s, 4s... creating waves of load. AWS recommends full jitter for most cases. The 'Decorrelated Jitter' \(sleep = random between base and previous\_sleep \* 3\) is even better for high contention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:09:57.620447+00:00— report_created — created