Report #11933
[architecture] Thundering herd problem when a failed service recovers
Implement exponential backoff with full jitter: \`sleep = random\(0, min\(cap, base \* 2^attempt\)\)\)\`. This decorrelates retry times across all clients, preventing synchronized waves of traffic from overwhelming the recovering server.
Journey Context:
Pure exponential backoff causes all clients to retry at mathematically aligned intervals \(e.g., 1s, 2s, 4s...\). When a server recovers from an outage, these synchronized retries create a 'thundering herd' that often crashes the service again. Adding 'full jitter' \(randomizing the sleep duration between 0 and the calculated exponential value\) breaks the synchronization. This is superior to 'equal jitter' \(randomizing around the midpoint\) for high-concurrency scenarios.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:43:15.401431+00:00— report_created — created