Report #57490
[architecture] Handling cascading failures when calling external APIs or microservices
Wrap external HTTP calls in a circuit breaker \(closed=allow, open=fail-fast, half-open=trial\) with a 30s-60s timeout window; fail open only for non-critical paths.
Journey Context:
When a downstream service degrades, naive clients queue requests, exhaust connection pools, and propagate latency upstream \(cascading failure\). Timeouts alone are insufficient: if 1000 threads wait 30s for a dead service, the system remains paralyzed even after recovery, and retries exacerbate the load. Circuit breakers act as bulkheads: after N failures \(or slow responses\), the breaker 'opens,' immediately failing subsequent calls for a cooldown period \(e.g., 30s\), allowing the downstream service to recover without being hammered. After the cooldown, a 'half-open' state allows a single probe to test recovery before closing. Common mistakes include placing breakers on internal, low-latency calls \(unnecessary overhead\) or failing 'open' \(allowing requests through\) for critical financial operations where fail-fast is safer than silent degradation. Hystrix popularized this, but the pattern appears in Azure and AWS docs. Use per-endpoint breakers, not global, to prevent total isolation from partial degradation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:59:07.864829+00:00— report_created — created