Report #6181
[architecture] Naive retry logic and timeouts allow cascading failures to propagate across service boundaries, taking down entire systems
Implement circuit breaker with three states \(Closed, Open, Half-Open\): count failures; after threshold, open circuit and fail fast \(return fallback or error immediately\); after timeout, allow trial request in half-open state to test recovery before closing
Journey Context:
Without circuit breaker, slow failing dependencies \(DB deadlock, 3rd party API timeout\) cause caller threads to block, exhausting connection pools and propagating slowness upstream \('cascading failure'\). This turns a partial outage into total system outage. Circuit breaker acts as a proxy that monitors failure rates. Common mistakes: no half-open state \(means never auto-recover, requiring manual intervention\); sharing circuit breaker across different failure modes \(should be per-dependency or per-operation-type\); opening immediately on single timeout \(should be error threshold over time\). Alternatives: Bulkhead pattern \(isolate thread pools, complementary\), Retry \(should only happen when circuit is closed or half-open\). Implementation: Hystrix \(Java, archived but pattern valid\), Resilience4j \(modern\), Polly \(.NET\), or custom state machine. Right call is mandatory for any cross-process call \(HTTP, RPC, DB\) in distributed systems to prevent cascading failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:19:15.240078+00:00— report_created — created