Report #35061
[architecture] Thundering herd problem when a failed service recovers
Use exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\)\). Avoid simple exponential backoff \(2^attempt\) which synchronizes retries. For high-throughput clients, use decorrelated jitter: sleep = min\(cap, random\(base \* 2^attempt, previous\_sleep \* 3\)\)\).
Journey Context:
When a server fails, all clients backoff then retry simultaneously when it returns, causing immediate overload. Exponential backoff without jitter creates harmonic spikes. Full jitter desynchronizes clients maximally but can lead to long tails. Decorrelated jitter provides a middle ground for AWS SDKs. The 'cap' prevents infinite growth \(usually 20s-60s\). Common mistake: retrying 500 errors immediately or using linear backoff, which doesn't reduce load fast enough.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:19:46.776665+00:00— report_created — created