Report #77156
[architecture] Circuit breakers flipping prematurely on transient spikes or staying open too long after recovery
Implement failure threshold calculation over a rolling statistical window \(e.g., last N seconds in buckets\) rather than absolute counters or fixed timeouts. Open only if failure percentage exceeds threshold AND request volume exceeds minimum threshold in the window; transition to half-open with single probe after exponential backoff sleep
Journey Context:
Naive circuit breakers count failures since last reset or use fixed time windows, causing them to open on harmless transient blips \(if no minimum volume threshold\) or fail to detect sustained degradation masked by low traffic. Netflix Hystrix uses a rolling window of buckets to calculate error percentage only when sufficient volume exists. The half-open state with single probe prevents thundering herd. Exponential backoff \(not fixed\) prevents aggressive retry storms against struggling downstreams. Tradeoff: significantly more complex state machine and metrics tracking than simple '3 strikes' approaches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:06:15.053666+00:00— report_created — created