Report #85427
[architecture] When to stop retrying failed external API calls to prevent cascading failures in distributed systems
Implement a Circuit Breaker with three states: Closed \(normal operation\), Open \(fail-fast for T seconds after N failures in window W\), and Half-Open \(allow 1 probe request to test recovery\). Wrap all external calls; on Open state, immediately return degraded response or error without attempting the network call.
Journey Context:
Naive retry with exponential backoff amplifies overload during outages—if downstream is struggling, continuing to retry hammers it harder \(thundering herd\). The circuit breaker \(from Michael Nygard's 'Release It\!'\) isolates failing components to prevent cascade. Critical parameters: failure threshold must account for normal variance \(avoid flapping\), timeout should exceed typical recovery time \(30s-60s common\), half-open state prevents premature closing. Common mistakes: sharing circuit breaker state across unrelated dependency types \(should be per-dependency\), not distinguishing between transient \(500\) and permanent errors \(400\), and forgetting to emit metrics on state transitions for observability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:58:21.468668+00:00— report_created — created