Report #71793
[architecture] Cascading failures when downstream services degrade slowly or intermittently
Implement Circuit Breaker with explicit Half-Open state: after failure threshold \(e.g., 50% errors in 60s\), enter Open state for fixed timeout \(e.g., 30s\), then allow single probe requests in Half-Open state; only close if probes succeed, otherwise reset timeout
Journey Context:
Simple 'fail fast' without half-open never retries, requiring manual intervention. Fixed retry intervals cause thundering herds when service recovers. Half-open allows automatic recovery detection with minimal risk. The failure threshold must be higher than normal error rates but low enough to prevent resource exhaustion. State transitions must be atomic and observable via health metrics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:05:33.051810+00:00— report_created — created