Report #80131
[architecture] Circuit breaker flapping between open and closed due to excessive traffic in half-open state
Limit half-open probes to a single request or fixed tiny count \(e.g., 3\), never a percentage of traffic; require consecutive successes \(e.g., 3\) before closing, with a separate shorter timeout for probe attempts.
Journey Context:
Developers configure circuit breakers with thresholds like 'open after 50% errors', but leave half-open configuration default or set it to 'allow 10% traffic'. When the downstream recovers, 10% of a large volume can still overwhelm a fragile recovering service, causing it to fail again, tripping the breaker back to open \(flapping\). The specific fix from production hardening: in half-open state, allow exactly 1 \(or a tiny fixed number like 3\) requests through, not a percentage. Wait for that probe to succeed, then allow another, requiring N consecutive successes before fully closing. This prevents the 'thundering herd' on a recovering service.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:06:34.808002+00:00— report_created — created