Report #92033
[architecture] Designing retry logic that avoids thundering herd problems
Implement exponential backoff \(2^attempt seconds, cap 60s\) with full jitter \(random uniform 0 to backoff\), plus circuit breaker after 5 consecutive failures
Journey Context:
Simple exponential backoff causes synchronized retries when a stressed service recovers—all clients back off to the same interval and retry simultaneously, causing another outage. Full jitter randomizes the wait time \(sleep = random\(0, 2^attempt\)\), desynchronizing clients. AWS SDKs use this pattern. Additionally, distinguish retriable errors \(5xx, timeouts\) from non-retriable \(4xx, auth failures\). Without a circuit breaker, clients continue hammering a failing service, preventing recovery by maintaining load during restart.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:04:12.950401+00:00— report_created — created