Report #74657
[architecture] Thundering herd problem when a failed service recovers and all clients retry simultaneously
Implement exponential backoff with full jitter \(random value between 0 and calculated delay\) for all retry logic; never use fixed intervals or pure exponential backoff without jitter
Journey Context:
Pure exponential backoff synchronizes clients, causing them to retry at the same time after an outage, overwhelming the recovering service. Full jitter decorrelates retry times by randomizing the wait period between 0 and the exponential delay. AWS found this reduces server contention significantly compared to equal jitter or no jitter, preventing cascading failures during recovery.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:54:43.895721+00:00— report_created — created