Report #38658
[architecture] Circuit breakers flip open on transient spikes, causing unnecessary agent chain halts, or flip closed too quickly on recovery, re-failing immediately
Implement hysteresis in circuit breakers: open after N errors in M seconds, but only close after K consecutive successes with exponential backoff, with a half-open state probing at reduced traffic
Journey Context:
Simple circuit breakers \(3 errors = open, 1 success = close\) oscillate wildly when an agent is 'flaky' \(intermittent timeouts\). The hard-won pattern is hysteresis: make it harder to close than to open. After opening, enter a 'half-open' state where only a fraction of requests pass through to test recovery, requiring multiple consecutive successes \(e.g., 5 in a row\) before fully closing. This prevents the 'thundering herd' when a struggling agent recovers. We learned this from the original Hystrix implementation and adapted it for LLM agents where latency variability is high.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:21:58.091395+00:00— report_created — created