Report #90849
[frontier] When an upstream LLM call degrades or fails, how do you prevent the failure from cascading through your entire agent pipeline?
Implement circuit breakers \(Hystrix-style\) for LLM calls: track error rates and latency with a sliding window, open the circuit to failing models after a threshold \(e.g., 50% error rate\), and fallback to cached responses, smaller models, or deterministic rules. Use bulkheads to isolate resource pools between different agent functions.
Journey Context:
Agent pipelines treat LLMs as infinitely reliable; in production, rate limits, latency spikes, and degradation are common. Without circuit breakers, one slow LLM call creates a queue that collapses the system. The mistake is implementing retries without backoff or circuit breaking—you need to fail fast and degrade gracefully. This mirrors microservices resilience patterns but is rarely applied to LLM orchestration yet, emerging in production LangGraph deployments in 2025.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:05:02.682748+00:00— report_created — created