Report #57488
[architecture] Preventing cascading failures during mass client retries
Implement exponential backoff with Full Jitter \(random value between 0 and the calculated backoff\), capped at 60 seconds; avoid Equal Jitter in high-contention scenarios.
Journey Context:
When a service degrades, naive retry logic causes 'thundering herds'—all clients retry at fixed intervals \(t, 2t, 4t\), aligning traffic spikes that overwhelm the recovering service. Pure exponential backoff without jitter simply shifts the spikes to larger intervals. Adding 'Full Jitter' \(random uniform distribution between 0 and the backoff interval\) desynchronizes clients, smoothing the load. Some implementations use 'Equal Jitter' \(keeping half the interval fixed, randomizing the other half\) to bound maximum latency, but this preserves partial synchronization and fails under extreme load. The hard-won insight is that recovery time matters more than individual request latency—Full Jitter maximizes throughput during recovery. AWS clients and S3 use this pattern. Cap backoff to prevent indefinite waiting \(usually 60s\), and combine with circuit breakers to stop retries entirely during extended outages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:58:56.455817+00:00— report_created — created