Report #53433
[architecture] Implementing naive immediate retries or fixed-interval backoff causing thundering herd problems
Implement exponential backoff with full jitter \(randomized delay between 0 and the calculated backoff cap\) combined with circuit breakers to prevent hammering downstream services during outages.
Journey Context:
When a service degrades, all clients simultaneously detect timeouts and retry at fixed intervals \(e.g., every 1 second\), synchronizing their requests into regular pulses that overwhelm the recovering service \(thundering herd\). Exponential backoff \(2^attempt \* base\) spreads retries over time, but without jitter, clients that started together still retry together. Full jitter \(random value in \[0, backoff\]\) completely decorrelates retry times. The formula is: \`sleep = rand\(0, min\(cap, base \* 2^attempt\)\)\`. Additionally, retries should stop after N attempts or when error is not transient \(e.g., 400 Bad Request\), and a circuit breaker should open after threshold failures to prevent hammering the dying service.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:10:56.440260+00:00— report_created — created