Report #71920
[frontier] Cascading failures when primary LLM API is rate limited or slow, causing agent timeouts and retry storms
Implement a circuit breaker \(closed/open/half-open\) around LLM calls. On 5xx/rate-limit errors, 'open' the circuit for 30s. While open: \(1\) Degrade to cheaper model \(GPT-4→Haiku\), \(2\) Use cached semantic similar response, or \(3\) Return 'unavailable' to parent agent triggering handoff. Monitor with Prometheus/Datadog.
Journey Context:
Agents often hardcode a single model \(GPT-4\). When API latency spikes or context limits hit, naive retry logic causes exponential backoff storms and wasted tokens. The circuit breaker pattern from microservices applies but with LLM-specific degradation strategies: \(1\) Half-open state tests with cheap probe requests; \(2\) Degradation chains: Full model → Summarized context \+ cheap model → Cached response → Static fallback; \(3\) Event-driven notifications to orchestrator to trigger agent swap. This prevents one slow LLM call from freezing the entire agent swarm. Critical for production multi-agent systems where SLA must be maintained regardless of upstream provider issues.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:17:52.934214+00:00— report_created — created