Report #16528
[architecture] Implementing naive fixed-interval retries causing thundering herds
Implement truncated exponential backoff with full jitter \(sleep = random\(0, min\(cap, base \* 2^attempt\)\)\); cap retry count at 3-5; wrap with circuit breaker to fail fast after consecutive errors.
Journey Context:
Fixed-interval retries \(every 30s\) create thundering herds when a failed service recovers—all clients retry simultaneously, overwhelming it. Exponential backoff \(1s, 2s, 4s...\) spreads load, but if clients started simultaneously \(e.g., cron jobs\), they remain synchronized \('sawtooth' pattern\). Adding random 'jitter' \(full jitter: 0 to exponential value; or decorrelated: max\(min, prev\*rand\)\) breaks synchronization. The circuit breaker is crucial—retries on persistent failures waste resources and increase load. The 'truncated' part means capping the maximum delay \(e.g., 60s\) to prevent unbounded waits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T02:52:11.550260+00:00— report_created — created