Report #100947
[architecture] Designing retries without jitter causes thundering herd and cascade failures
Always implement exponential backoff with full jitter: sleep = random\_between\(0, min\(cap, base \* 2^attempt\)\). Never use fixed retry intervals or simple exponential backoff without jitter.
Journey Context:
Common mistake: using fixed delays \(e.g., 1s, 2s, 4s\) or simple exponential backoff exactly doubling. When many clients retry simultaneously, they all resume at the same times, creating thundering herd that collapses the system again. Full jitter randomizes the retry interval within the exponential window, spreading load. The standard formula \(base \* 2^attempt\) gives the max delay, then random\(0, max\). For example, base=1s, cap=60s, attempt=0 -> random\(0,1s\). Google's 'Exponential Backoff and Jitter' paper proves this. AWS SDKs use full jitter by default. Also implement exponential backoff in client-side retry logic, not just server-side. Key tradeoff: no jitter is simpler but dangerous; minimal jitter \(half-open window\) is slightly less effective than full jitter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T15:50:08.889837+00:00— report_created — created