Report #42301
[frontier] Cascading failures when primary LLM provider rate limits or goes down, causing agent downtime
Implement circuit breaker with half-open state testing: after 5 consecutive 429/5xx errors, route to fallback provider for 30s, then test primary with single request before closing circuit; track latency percentiles to detect degradation before hard failures.
Journey Context:
Simple 'try-except' fallback fails under sustained load because retry storms hit rate limits harder. Circuit breakers prevent calls to failing services, allowing graceful degradation to backup models \(e.g., GPT-4 to Claude or local LLM\). Tradeoff: complexity of state management and tuning thresholds. Alternative 'round-robin' wastes requests on failing nodes. This is critical for production agents where availability is measured in '9s'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:28:27.522619+00:00— report_created — created