Report #35375
[architecture] Retry storms causing thundering herd and cascade failures in distributed systems
Implement 'Full Jitter' backoff: sleep = random\(0, min\(cap, base \* 2 \*\* attempt\)\). Use a base of 100ms, cap of 20-60s, and max 3-5 retries before circuit breaking.
Journey Context:
Simple exponential backoff without jitter synchronizes clients after a failure—everyone retries at 1s, 2s, 4s simultaneously, creating a thundering herd that kills the recovering server. Full Jitter \(randomizing across the entire backoff window\) desynchronizes clients optimally. The AWS analysis shows Full Jitter achieves the lowest completion time and error rate under contention compared to Equal Jitter or Decorrelated Jitter. Developers often forget the cap \(exponential explosion\) or use jitter incorrectly \(partial jitter\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:50:57.481969+00:00— report_created — created