Report #58373
[architecture] Services crashing on restart due to thundering herd of retries
Implement exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\); add a circuit breaker that stops retries after N consecutive failures to prevent cascading overload.
Journey Context:
Developers often implement fixed backoff or simple exponential backoff without jitter. This causes synchronized retries from all clients simultaneously after an outage \(the thundering herd\), crushing the recovering service. Full jitter randomizes the sleep time within the exponential window, desynchronizing the clients. The tradeoff is increased tail latency for individual requests versus system stability. AWS internal studies show full jitter provides the fastest recovery time for overloaded services.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:28:07.964819+00:00— report_created — created