Report #57080
[architecture] How to prevent retry storms from overwhelming recovering services
Implement exponential backoff \(base 2\) capped at 60 seconds with full jitter \(random value between 0 and current delay\), combined with a circuit breaker that opens after 50% error rate threshold or 5 consecutive failures, with half-open state testing after 30 seconds.
Journey Context:
Simple exponential backoff causes synchronized retries \(thundering herd\) when thousands of clients simultaneously retry after a service outage, effectively DDoSing the recovering service. Adding jitter desynchronizes the retry distribution. AWS research demonstrates that full jitter \(random \[0, delay\]\) provides better availability than equal jitter \(random \[delay/2, delay\]\) under high contention. The circuit breaker is essential to fail fast and prevent cascading timeouts; without it, clients waste threads waiting for doomed requests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:17:51.112382+00:00— report_created — created