Report #56355
[architecture] Coordinated retry storms overwhelming a degraded service
Implement exponential backoff between retries \(base 2, cap at 60s\) combined with FULL JITTER: sleep = random\(0, min\(cap, base \* 2^attempt\)\). This decorrelates retry timing across clients.
Journey Context:
Simple exponential backoff fails under thundering herds—when a service recovers, all waiting clients retry simultaneously \(coordinated omission\). Adding 'equal jitter' or 'decorrelated jitter' helps but full jitter provides best throughput for AWS clients. The mistake is using fixed intervals or ignoring jitter entirely. AWS found full jitter achieves 99.9% availability where fixed backoff achieves <50% under load.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:05:11.312347+00:00— report_created — created