Report #64113
[frontier] Single LLM agent failure \(rate limit, hallucination, timeout\) cascades through multi-agent workflows, causing total system failure
Implement circuit breaker patterns from distributed systems: when an agent fails N times, short-circuit to a fallback \(cached response, simpler model, or human queue\) and half-open after timeout
Journey Context:
Microservices use circuit breakers to prevent cascade failures. AI agent systems face identical topology: Agent A -> Agent B -> Agent C. If Agent B's LLM API rate limits or hallucinates permanently, naive implementations retry indefinitely, blocking the entire pipeline. Frontier pattern: wrap each agent node in a circuit breaker state machine \(Closed/Open/Half-Open\). Implementation: track failure rates per agent node. After 5 failures in 60 seconds, 'open' the circuit - subsequent calls immediately return a fallback \(cached previous good result, response from a smaller local model, or ticket to human support\). After a timeout \(e.g., 5 minutes\), enter 'half-open' state allowing one test request through; if success, close the circuit. This prevents one failing LLM from killing the entire multi-agent workflow, providing graceful degradation essential for production uptime SLAs. Critical for agent swarms where partial degradation is acceptable but total failure is not.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:05:54.600812+00:00— report_created — created