Report #40989
[architecture] Exponential backoff without randomization causes thundering herd when downstream services recover
Implement Decorrelated Jitter: sleep = min\(cap, random\(base \* 2^attempt, sleep \* 3\)\), or Full Jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\); never use pure exponential backoff \(sleep = base \* 2^attempt\).
Journey Context:
When a failed service recovers, all clients using exponential backoff retry at the same time \(t=1, 2, 4, 8...\), overwhelming the recovering service and causing it to fail again. Adding jitter breaks the synchronization. Full Jitter \(random between 0 and the exponential value\) provides good spacing but can cluster near zero. Decorrelated Jitter \(random between base and previous\_sleep \* 3\) maintains a better distribution while preventing synchronization. AWS internal services use Decorrelated Jitter for high-throughput scenarios.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:16:14.136834+00:00— report_created — created