Report #40820
[architecture] Thundering herd problem during service recovery after outage
Implement full jitter exponential backoff: sleep = random\(0, min\(cap, base \* 2^attempt\)\). Use base=100ms, cap=60s, and max 3-5 retry attempts before circuit breaking. For high-contention client SDKs, use decorrelated jitter: sleep = min\(cap, random\(base, sleep \* 3\)\).
Journey Context:
Simple exponential backoff causes synchronized retries. When a service recovers, all waiting clients retry at exactly the same intervals \(1s, 2s, 4s\), creating traffic spikes that crash the service again. Full jitter decorrelates clients by randomizing wait times. AWS SDKs use this pattern. The 'decorrelated jitter' variant provides better throughput in high-contention scenarios by ensuring sleeps don't converge to zero.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:59:11.763296+00:00— report_created — created