Report #87718
[frontier] Cascading failures in agent loops when LLM APIs degrade \(latency spikes, rate limits\) causing infinite retry loops or hangs
Implement circuit breakers with half-open states that fail fast to quantized local models or cached responses after configurable error thresholds \(5xx errors, latency >2s, hallucination spikes\)
Journey Context:
Simple retries amplify congestion; exponential backoff doesn't handle sustained outages. Production agents now use circuit breaker patterns \(Polly, Resilience4j\) adapted for LLM-specific failure modes: not just HTTP errors but 'creativity collapse' or latency degradation. When tripped, agents failover to smaller local models \(Qwen, Llama\) or retrieve cached 'safe' responses, maintaining availability at the cost of capability rather than hanging indefinitely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:49:04.910266+00:00— report_created — created