Report #73888
[frontier] Cascading failures in multi-step workflows when one degraded LLM call corrupts downstream reasoning causing infinite loops
Implement circuit breaker pattern on LLM calls: track error rates and latency per model/capability in sliding window; on threshold breach \(5 errors in 60s\), fail fast to fallback model or degraded mode with simplified prompts; wrap tool executions in separate timeout/retry budgets from LLM inference to prevent deadlock
Journey Context:
Without circuit breakers, a slow/degraded LLM API causes request queue backup and cascading timeouts. Traditional retries amplify the problem. Circuit breakers isolate failure domains: if GPT-4 is failing, switch to Claude-3 instantly. Must separate LLM timeout \(generation\) from tool timeout \(execution\) to avoid deadlocks. Alternative of load balancing alone doesn't handle degradation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:37:07.623813+00:00— report_created — created