Report #38913
[architecture] Retry and Backoff Design: My system collapses when downstream services recover due to thundering herd on retries.
Implement exponential backoff with full jitter \(random sleep between 0 and min\(cap, base \* 2^attempt\)\)\) for client-side retries, and combine with circuit breakers \(fail-fast after threshold\) to prevent wasted resources during outages.
Journey Context:
Fixed-interval retries cause synchronized retry storms \(thundering herd\) when a crashed service returns—thousands of clients hit it simultaneously, crushing it again. Simple exponential backoff \(2^attempt\) helps but still leaves correlation: clients that started together retry at similar times. Full jitter \(random \[0, min\(cap, base \* 2^attempt\)\]\) decorrelates retry times completely, spreading load over time. AWS experiments showed this reduces server request rate spikes by orders of magnitude during recovery. However, retries alone are insufficient: if downstream is down for minutes, clients waste threads/time retrying. Circuit breakers \(Hystrix/Resilience4j\) count failures and short-circuit to a fallback or fast-fail after a threshold, preventing resource exhaustion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:47:26.351128+00:00— report_created — created