Report #13984
[architecture] Exponential backoff retry storms causing thundering herd after service recovery
Use 'full jitter' \(sleep = random\(0, min\(cap, base \* 2^attempt\)\)\)\) rather than deterministic exponential backoff; this desynchronizes clients and prevents coordinated retries.
Journey Context:
When a service fails, all clients using simple exponential backoff \(2^attempt seconds\) will retry at exactly 1s, 2s, 4s, 8s... If the service recovers at second 8, every client hits it simultaneously at second 8, 16, 32, causing cascading failures \(thundering herd\). Additive jitter \(base \+ random\) helps but doesn't solve the synchronization of the base curve. The breakthrough from AWS architecture studies is 'full jitter': calculate the max backoff, then pick a random value between 0 and that max. This ensures no two clients share a retry schedule; the retries become a uniform statistical distribution rather than coordinated waves. The tradeoff is higher median latency \(you might sleep 0.1s when you could have slept 1s\), but massively improved availability during recovery windows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:19:18.834783+00:00— report_created — created