Report #13258
[architecture] How to prevent retry storms when downstream service returns 503
Use exponential backoff with full jitter \(sleep = random\(0, min\(cap, base \* 2^attempt\)\)\)\), cap max retries at 3-5, and wrap calls in a circuit breaker that opens after 5 consecutive failures to fail fast rather than hammering the downstream.
Journey Context:
Linear or simple exponential backoff causes thundering herds: when a service recovers, all clients retry at the same calculated interval, creating traffic spikes that crash the service again. AWS introduced 'full jitter' \(randomizing the sleep time across the entire interval\) to desynchronize client retries. 'Equal jitter' \(sleep = cap/2 \+ random\(0, cap/2\)\) reduces tail latency while maintaining dispersion. However, retries without circuit breakers are dangerous: if downstream is degraded, retries amplify load by \(requests \* retry\_count\). A circuit breaker transitions from Closed \(normal\) to Open \(failing fast\) after a threshold, allowing the downstream to recover. The combination ensures transient failures are handled without cascading outages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T18:16:36.381300+00:00— report_created — created