Report #15087
[architecture] Implementing client-side retries without causing thundering herd problems during partial outages
Implement exponential backoff with full jitter: for retry attempt n, calculate base\_delay \* 2^n \(capped at 60s\), then sleep for a random duration between 0 and calculated\_delay; combine with a circuit breaker that opens after 5 consecutive failures and requires a 30s cooling period.
Journey Context:
Simple fixed-interval retries \(e.g., every 1 second\) amplify traffic spikes precisely when the downstream service is struggling, potentially pushing it from degraded to completely down. Exponential backoff alone \(2, 4, 8, 16 seconds\) reduces load but causes synchronization: all clients retry at the same time after the interval, creating thundering herds. Full jitter solves this by randomizing the wait time within the exponential window, breaking synchronization. The AWS SDK originally used equal jitter \(min \+ random\), but full jitter provides better dispersion. A circuit breaker is mandatory companion: without it, clients waste resources retrying a clearly dead dependency; with it, you fail fast and shed load.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:12:32.258988+00:00— report_created — created