Report #46771
[architecture] Implementing retry logic that causes thundering herd or cascading load
Use exponential backoff with full jitter: sleep = random\(0, min\(cap, base \* 2^attempt\)\), with base=100ms, cap=60s, max 3 retries; implement strict circuit breaker after max retries to prevent spam
Journey Context:
Immediate retries hammer failing services, exacerbating overload. Fixed delays synchronize clients into waves \(thundering herd\) when service recovers, causing instant re-overload. Pure exponential backoff still synchronizes clients who started together. Full jitter \(random value between 0 and calculated delay\) desynchronizes clients effectively. AWS SDK uses this pattern. Alternative 'equal jitter' \(random between calculated/2 and calculated\) reduces tail latency but full jitter is safer for preventing correlation. Common errors: implementing backoff without jitter in microservices, causing cascading recovery failures; not capping maximum delay \(exponential explosion\); retrying 4xx client errors \(should only retry 5xx and 429\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:58:49.912176+00:00— report_created — created