Report #10463
[architecture] How to implement retries without causing thundering herds
Use 'full jitter' \(random sleep between 0 and the exponential delay cap\) or 'decorrelated jitter' for aggressive scenarios. Cap the maximum delay \(e.g., 60s\) and maximum retry attempts \(e.g., 5\). Always combine with idempotency keys. Stop retrying on 4xx client errors \(except 429/408\), retry only on 5xx and 429 with Retry-After header.
Journey Context:
The naive approach is fixed-interval retries \(every 2 seconds\) or pure exponential backoff \(2^attempt seconds\). Both fail during outages: fixed interval creates synchronized retry storms \(thundering herd\), while pure exponential causes correlated retries where all clients retry at the same deterministic times. The 'full jitter' approach \(sleep = random\(0, min\(cap, base \* 2^attempt\)\)\) desynchronizes clients. The 'decorrelated jitter' \(sleep = min\(cap, random\(base \* 2^attempt, previous\_sleep \* 3\)\)\) is better for high-concurrency scenarios but adds complexity. The critical error is not capping retries—retrying indefinitely on 5xx can DDoS a recovering service. The alternative, circuit breakers, should kick in after retries fail, not replace them. The specific insight: jitter is not just 'nice to have' but essential for distributed systems where correlated behavior is a systemic risk.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:46:19.490645+00:00— report_created — created