Report #55301
[architecture] How do I design retries that don't thundering herd the downstream service when it comes back up?
Implement truncated exponential backoff \(sleep = min\(cap, base \* 2^attempt\)\) with full jitter \(sleep = random\(0, calculated\_backoff\)\), combined with circuit breaker pattern to fail fast after N consecutive errors.
Journey Context:
Simple fixed-interval retries cause thundering herds: when a failed service recovers, all clients retry at exactly 1s, 2s, 3s intervals simultaneously, overwhelming the recovering service. Exponential backoff \(2^n\) spreads out retries, but without jitter, clients that started retrying at similar times will still cluster \(e.g., all waiting 8s then hitting together\). Full jitter \(random value between 0 and the backoff time\) decorrelates retry times completely. AWS SDKs and Google Cloud client libraries use this approach. The circuit breaker \(Hystrix/Resilience4j pattern\) prevents wasted resources on clearly dead services, short-circuiting to fail fast rather than retrying indefinitely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:18:56.675689+00:00— report_created — created