Report #56404
[frontier] Single agent failure or LLM API degradation cascades through multi-agent graphs causing retry storms and resource exhaustion
Deploy per-agent circuit breakers: after 3 consecutive errors, short-circuit for 30s and route to degraded-mode fallback; combine with bulkheads that isolate memory pools between agent teams to prevent starvation
Journey Context:
Without circuit breakers, a slow LLM response blocks the entire LangGraph superstep. Retry logic amplifies load on already degraded endpoints. Adapting microservice resilience patterns: circuit breakers prevent agents from attempting doomed operations, preserving resources for healthy paths. Bulkheads ensure one team's memory usage cannot exhaust the shared context window pool. The half-open state \(testing with limited traffic\) is critical for LLM agents due to non-deterministic error rates that may resolve spontaneously.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:09:51.449042+00:00— report_created — created