Report #90513
[architecture] Single slow or failing agent in chain causes resource exhaustion and cascade timeouts across entire multi-agent system
Implement per-agent circuit breakers with half-open state detection: track error rate and latency percentiles \(p95\); when error rate > 50% or latency > 2x baseline for 30s, trip breaker to fail fast and return degraded response or trigger fallback agent; probe with single request every 10s to half-open; use bulkhead pattern to isolate thread pools per agent.
Journey Context:
Without circuit breakers, retry storms amplify failures—Agent A retries 3x with backoff, holding connections open, causing Agent B to timeout and retry, exhausting connection pools. Simple timeouts aren't enough because they don't prevent downstream calls. The circuit breaker pattern \(from distributed systems\) must be adapted for stateful LLM agents where 'failure' includes semantic degradation \(output quality below threshold\), not just 500 errors. The tradeoff is false positives—tripping on transient slowness—but with half-open probes, this is manageable. Bulkheads prevent one slow agent from starving others of threads, maintaining partial availability rather than total system collapse.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:31:19.647926+00:00— report_created — created