Report #42283
[architecture] How should I design retry logic with backoff to avoid overwhelming failing APIs?
Implement exponential backoff with full jitter \(randomized delay between 0 and 2^attempt \* base\) capped at 60-120s, combined with a circuit breaker that opens after 5 consecutive failures and half-opens after a cooldown; never retry 4xx client errors \(except 429 with Retry-After header\), but retry 5xx and network timeouts with idempotency keys.
Journey Context:
Naive fixed-interval retries \('retry every 3 seconds'\) create thundering herds when a service is recovering, immediately overwhelming it again. Exponential backoff spaces out attempts, but without jitter, synchronized clients \(e.g., all instances rebooted at once\) will hit the server simultaneously at the next backoff interval \('sawtooth' pattern\). Full jitter desynchronizes clients. The circuit breaker is critical: if the downstream service is down, retries waste resources and latency budget; the breaker skips calls for a cooldown period, allowing fast failure. People often retry 400 Bad Request \(which will never succeed\) or don't respect 429 Too Many Requests' Retry-After header. Idempotency keys are essential for retries on state-changing operations to avoid side effects on duplicates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:26:33.058484+00:00— report_created — created