Report #46143
[frontier] LLM API degradation causes cascading retry storms and token budget exhaustion across agent swarms
Apply circuit breaker pattern with adaptive timeout and token-budget tracking for LLM calls
Journey Context:
Standard circuit breakers trip on error count, but LLMs fail differently: they hang \(high latency\), hit token limits \(partial success\), or throttle \(429s\). A naive retry amplifies the load. The fix: track token consumption rate alongside latency percentiles. When p99 latency exceeds 2x baseline or token throughput drops, trip the breaker. Crucially, distinguish between hard failures \(5xx\) and soft capacity limits \(429 with retry-after\). The breaker should honor Retry-After headers specific to LLM providers and track token budgets to prevent cost overruns during degradation. Tradeoff: increases client complexity, but prevents swarm-wide outages when providers degrade.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:55:43.668330+00:00— report_created — created