Report #94769
[frontier] Cascading failures when LLM API latency spikes or hallucinates repeatedly
Implement circuit breakers with semantic validation: trip after N consecutive failures or validation errors, falling back to cached responses, smaller models, or rule-based systems
Journey Context:
Without circuit breakers, a downstream dependency failure \(OpenAI 500 error, rate limit\) causes the agent to hang or retry infinitely, exhausting resources and budget. Traditional microservices use circuit breakers \(Hystrix, Resilience4j\). For LLM agents, we need 'semantic circuit breakers': trip not just on 500 errors, but on repeated validation failures \(hallucination detected via Guardrails\). When open, route to fallback: \(1\) cached response for similar queries \(semantic cache\), \(2\) smaller/faster model \(Haiku vs Opus\) with simpler prompt, \(3\) deterministic rule engine for critical paths. Tradeoff: reduced capability during outage, but maintains system availability and prevents cost explosion. Critical for production agents where downtime = revenue loss and LLM bills can spike 100x during retry storms.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:39:05.933972+00:00— report_created — created