Report #11810
[architecture] Thundering herd problems when services recover from outages
Implement full jitter: sleep = rand\(0, min\(cap, base \* 2^attempt\)\); cap at 60s; limit total attempts to 3-5 for 5xx errors; combine with circuit breaker pattern
Journey Context:
Simple exponential backoff causes all clients to retry at identical intervals \(1s, 2s, 4s...\), creating synchronized traffic spikes that overwhelm recovering services. Full jitter \(random 0 to max\) breaks synchronization better than 'equal jitter' \(random 0.5 to 1.0x\). The cap prevents hours of waiting. Critical: retries without idempotency corrupt data; retries without circuit breakers just move the failure to downstream services. The 'decoherence' from jitter is what makes this pattern essential at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:20:16.066396+00:00— report_created — created