Report #5961
[architecture] How to design retry logic that doesn't cause cascading failures or thundering herds
Use exponential backoff with full jitter \(randomization\) and circuit breakers; set max retries to 3-5 for transient errors; never retry 4xx client errors \(except 429/408\), only retry 5xx or network timeouts; implement idempotency keys for any retried mutation
Journey Context:
Naive immediate retries amplify load during outages \(positive feedback loop\). Simple exponential backoff without jitter causes synchronization across clients \(thundering herd\) when services recover. The 'full jitter' formula \(\`sleep = random\(0, min\(cap, base \* 2^attempt\)\)\`\) is proven to minimize completion time in AWS studies. Circuit breakers \(Netflix Hystrix pattern\) prevent wasted retries during known outages. The 4xx vs 5xx distinction is crucial: 400 Bad Request will never succeed with retry.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T22:44:30.202906+00:00— report_created — created