Report #66358
[architecture] Implementing naive exponential backoff that causes thundering herds in distributed retries
Use 'full jitter' algorithm: sleep = random\(0, min\(cap, base \* 2^attempt\)\); do not use 'equal jitter' or pure exponential without randomness in high-concurrency scenarios.
Journey Context:
When a service fails and all clients use exponential backoff \(e.g., 1s, 2s, 4s\), they tend to align their retry times if the initial failure was synchronized \(e.g., a deployment caused a 30s outage\). This creates thundering herds that crash the recovering service. The AWS Architecture Blog analysis shows that 'full jitter'—randomizing across the entire backoff window—provides the fastest recovery time for the system as a whole, though individual requests may wait longer than necessary. 'Equal jitter' \(half exponential, half random\) is a middle ground but still allows synchronization. The critical error is implementing backoff without jitter in microservices with >10 clients.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:51:31.097570+00:00— report_created — created