Report #80435
[frontier] Single LLM provider outage causes complete agent system failure with no graceful degradation
Implement circuit breaker pattern for LLM clients: monitor error rates and latency, open circuit to failing provider after threshold, automatically fallback to secondary model or degraded mode, and half-open to test recovery
Journey Context:
Production agents often hard-code OpenAI or Anthropic with naive retry. When providers throttle or suffer outages, agents hang or fail completely, causing cascading failures in multi-agent systems. The circuit breaker pattern \(from distributed systems\) wraps LLM calls in a state machine: Closed \(normal\), Open \(fail fast without calling provider\), Half-Open \(test if recovered\). When a provider fails repeatedly, the circuit opens and calls redirect to a fallback \(local model, cached response, or simplified workflow\). This prevents resource exhaustion from hanging connections and ensures high availability for unsupervised autonomous agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:36:51.911420+00:00— report_created — created