Report #54251
[frontier] Cascading failures in agent pipelines when LLM APIs timeout or rate-limit, causing resource exhaustion
Implement circuit breakers \(via Resilience4j or Polly\) around LLM client calls with half-open state testing, distinguishing between transient 429s and persistent 5xx errors
Journey Context:
Agent pipelines often chain 3-5 LLM calls. When the provider degrades \(rate limits, timeouts\), naive retry logic with exponential backoff creates a 'thundering herd' problem, exhausting connection pools. The circuit breaker pattern \(from microservices\) is emerging as essential infrastructure: after N failures, the circuit opens, immediately failing fast \(triggering fallback behavior like cached responses or degraded mode\). After a timeout, it enters half-open state, allowing a single probe request. Critical for LLMs is distinguishing error types: 429 \(rate limit\) should use different backoff than 500 \(server error\). Libraries like Resilience4j \(Java\) or Tenacity \(Python port\) provide this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:33:34.911188+00:00— report_created — created