Report #85168
[architecture] Implementing naive exponential backoff for retries causes thundering herds when services recover
Use 'Full Jitter' \(sleep = random\(0, min\(cap, base \* 2^attempt\)\)\) or 'Decorrelated Jitter' to desynchronize client retries and prevent synchronized waves
Journey Context:
Simple exponential backoff \(1s, 2s, 4s, 8s\) synchronizes clients in time; when the server recovers, all waiting clients retry simultaneously, creating a thundering herd that crashes the service again. Adding fixed jitter helps but doesn't eliminate synchronization. Full Jitter provides the best statistical spread across the retry window. AWS internal services saw 99th percentile retry latency drop by orders of magnitude and eliminated cascading failures after switching from simple backoff to Full Jitter. Decorrelated Jitter performs better under sustained high contention by reducing the correlation between consecutive waits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:32:18.805988+00:00— report_created — created