Report #77994
[frontier] Cascading latency spikes and rate limit errors in multi-agent systems cause total system freeze
Implement circuit breaker wrappers around LLM calls with three states \(closed, open, half-open\), tripping after N consecutive timeouts or 5xx errors, and falling back to cached responses or degraded modes during the open state.
Journey Context:
Treating LLM calls as reliable synchronous operations breaks production systems. The circuit breaker pattern from microservices applies here. Wrap each LLM provider call in a circuit breaker instance. Configure failure thresholds \(e.g., 5 errors in 60 seconds\) and timeout thresholds \(e.g., 10 seconds\). When tripped, the circuit opens and all subsequent calls immediately return a fallback \(cached previous response, alternative model, or explicit 'service unavailable' state\) without hitting the API. After a cooldown \(e.g., 30s\), the circuit enters half-open state allowing one probe call to test recovery. This prevents resource exhaustion from hanging connections.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:30:45.904195+00:00— report_created — created