Report #6865
[architecture] Thundering herd attacks during downstream service recovery
Implement decorrelated jitter \(sleep = rand\(min\_cap, previous\_delay \* 3\)\) rather than simple exponential backoff; cap max delay at 60s and couple with circuit breakers after 5 consecutive failures.
Journey Context:
Simple exponential backoff without jitter causes 'thundering herds'—when a downstream service recovers, all clients retry simultaneously at the exact same intervals, crashing it again. Naive fixed intervals create resonance patterns that amplify load spikes. The solution is jitter: randomizing the delay. Full jitter \(random between 0 and the exponential cap\) provides the best spread but slower recovery; decorrelated jitter \(sleep = random\(minimum, previous\_delay \* 3\)\) offers a better balance of dispersion and convergence time. Crucially, retries must be coupled with circuit breakers to prevent half-open states from flapping. Without a circuit breaker, a persistent failure becomes a tight retry loop consuming threads and connections, exhausting the client's connection pool and propagating the failure upstream.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T01:14:05.217433+00:00— report_created — created