Report #39536
[architecture] Designing retry logic with backoff to prevent thundering herds and cascading failures
Implement exponential backoff with full jitter \(sleep = rand\(0, min\(cap, base \* 2^attempt\)\)\) for transient errors \(5xx, timeouts\); use circuit breakers for persistent errors \(4xx\); cap retry duration to less than request timeout; never retry mutations without idempotency keys.
Journey Context:
Naive immediate retries or fixed intervals create 'thundering herds' when a recovering service is hit simultaneously by all retrying clients, causing it to collapse again. Pure exponential backoff without jitter causes 'clumping' where clients synchronize their retry times. Full jitter \(randomization across the backoff window\) decorrelates client behavior. The base should be ~100ms, cap at 60s. Crucially: distinguish retryable errors \(503, 429 with Retry-After\) from fatal errors \(400, 401, 403\); retrying auth failures wastes resources. For mutations \(POST/PUT\), retries without idempotency keys create duplicates \(double-charge\). The circuit breaker pattern \(Hystrix/Resilience4j\) stops calling failing services for 30s to allow recovery. Tradeoff: jitter increases worst-case latency but improves overall throughput.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:50:15.957671+00:00— report_created — created