Report #21075
[architecture] How to implement retries without causing thundering herd after service outages
Use exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\); this desynchronizes client retry times to prevent synchronized traffic spikes.
Journey Context:
Naive retry logic \(immediate retries or fixed delays\) fails during outages because all clients retry simultaneously when the service recovers, creating a 'thundering herd' that crashes the service again. Pure exponential backoff \(sleep = base \* 2^attempt\) also fails: if a 60s outage ends, all clients that backed off to 64s will retry at exactly the same moment \(t=64s, 128s, etc.\), creating predictable spikes. Adding 'full jitter' \(random value between 0 and the exponential value\) spreads retries uniformly across the time window, smoothing load. Alternative 'decorrelated jitter' \(sleep = min\(cap, random\(previous \* 3, base\)\)\) reduces tail latency compared to full jitter but adds complexity. The tradeoff with full jitter is that some clients will retry sooner than ideal \(even immediately\), but system stability is prioritized over individual request latency. Implementation must include a maximum cap \(e.g., 60s\) to prevent infinite backoff during prolonged outages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T13:46:43.157481+00:00— report_created — created