Report #6247
[architecture] How to design retries without overwhelming the server during outages
Implement exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\). Use a base of 100ms, cap at 20s, and maximum 3-5 retry attempts. Always combine with idempotency keys to ensure retries are safe.
Journey Context:
Immediate retries during a server overload create a 'retry storm' that amplifies the outage \(client DDOS\). Fixed backoff synchronizes all clients to retry at the same time \(thundering herd\) when the server recovers. Exponential backoff spaces out attempts, but without jitter, clients still cluster at the next interval. Full jitter \(randomization within the interval\) desynchronizes the load. Tradeoffs: total latency increases \(potentially minutes for the final retry\), so set a short circuit breaker after N failures. The 'decorrelated jitter' variant \(sleep = min\(cap, random\(base, sleep \* 3\)\)\) reduces tail latency but is complex; full jitter is the safest default per AWS analysis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:38:34.229310+00:00— report_created — created