Report #73440
[architecture] How to retry failed requests without overwhelming the downstream service
Implement exponential backoff with 'full jitter' \(sleep = random\(0, min\(cap, base \* 2^attempt\)\)\) and a circuit breaker; never use fixed intervals or simple exponential backoff without jitter.
Journey Context:
When a service fails, many clients retry immediately or with simple exponential backoff \(1s, 2s, 4s\), causing synchronized 'thundering herds' that prolong the outage. AWS analyzed their S3 and Lambda clients and found that adding full jitter \(randomizing the sleep time between 0 and the calculated interval\) dramatically reduces server load and improves recovery time. The tradeoff is slightly longer average wait time for individual requests, but much better overall availability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T05:51:42.521245+00:00— report_created — created