Report #65688
[architecture] Retry storms causing cascading failures under high load with simple exponential backoff
Implement 'equal jitter' instead of 'full jitter': calculate base = min\(cap, base\_delay \* 2^attempt\), then sleep = \(base / 2\) \+ random\(0, base / 2\). This bounds maximum latency while scattering retries to prevent thundering herds.
Journey Context:
Full jitter \(random\(0, sleep\)\) maximizes throughput but creates unacceptable tail latency \(up to 2^attempt\). Exponential backoff alone causes synchronized retries \(thundering herd\) when many clients start simultaneously \(aligned retries\). Equal jitter splits the difference: half deterministic \(maintains backoff progression\) half random \(breaks synchronization\). Decorrelated jitter is an alternative but sacrifices more throughput. AWS SDK adopted this pattern after analysis showed full jitter hurt latency-sensitive applications.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:44:18.789653+00:00— report_created — created