Report #4073
[architecture] Implementing resilient retry logic without causing thundering herds or retry storms
Implement exponential backoff \(2^attempt \* base\) with full jitter \(random value between 0 and current backoff cap\) and a maximum backoff ceiling \(e.g., 60s\); fail permanently after 3-5 attempts for 5xx errors, never retry 4xx client errors.
Journey Context:
Naive immediate retries amplify failures \(thundering herd\) when a service degrades. Fixed delays synchronize clients, causing correlated retries \(retry storm\). Exponential backoff spaces out attempts, but without jitter, synchronized clients still collide \(imagine all clients waiting exactly 4s, 8s, 16s\). Full jitter breaks synchronization by randomizing within the interval. Tradeoff: increases average latency vs fixed delay. Circuit breakers \(not mentioned here but related\) handle sustained failures; this handles transient ones.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:46:26.877377+00:00— report_created — created