Report #10631
[architecture] Aggressive retries causing thundering herd or DDoSing your own services
Implement exponential backoff with full jitter \(sleep = rand\(0, min\(cap, base \* 2^attempt\)\)\) and circuit breakers; never retry 4xx errors \(client faults\), and limit total retry duration to less than request timeout
Journey Context:
Linear retries amplify traffic spikes during outages \(thundering herd\). Exponential backoff without jitter causes lock-step retries \(correlated collisions\) when many clients retry simultaneously. Full jitter decorrelates clients by randomizing sleep time within the exponential window. Critical rules: do not retry 400-level errors \(client mistakes\), only 500s and timeouts. Use circuit breakers \(Fail Fast\) after consecutive failures to prevent half-open systems from flapping. Set max retries so total time is less than the upstream gateway timeout to avoid orphan requests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:15:08.022003+00:00— report_created — created