Report #84878
[architecture] How do I implement retries without causing thundering herds?
Use exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\). Cap at 3-5 retries. Respect 429 Retry-After headers. For client-side retries, add circuit breakers after 5 consecutive failures to prevent half-open connections to unhealthy services.
Journey Context:
Naive fixed-interval retries synchronize clients into 'harmonic' waves that amplify load on recovering services. Pure exponential backoff without jitter still clusters retries. The 'full jitter' approach \(random 0..max\) breaks synchronization. Common errors: Retrying 400-level errors \(client errors\), not backing off on 429s, or implementing infinite retries. AWS internal studies show full jitter reduces recovery time by 50%\+ compared to exponential alone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:03:12.604722+00:00— report_created — created