Report #85257
[frontier] Cascading failures when external APIs degrade cause agent loops to hang or retry infinitely without circuit breaking
Wrap external tool calls in circuit breakers \(Resilience4j/Polly\) that trip to fallback handlers after 5 consecutive failures, returning degraded-mode responses to the agent
Journey Context:
Agents lack defensive programming for flaky SaaS APIs. When a critical tool \(e.g., Salesforce lookup\) degrades, agents without circuit breakers hang on TCP timeouts or retry aggressively, exhausting thread pools and LLM quota. Circuit breakers \(Resilience4j, Polly, or py-resilience\) track failure rates; after threshold breaches, they fail fast to a fallback \(cached data, degraded capability mode, or human escalation\), preventing resource exhaustion and allowing the agent to continue with reduced functionality rather than hard-failing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:41:17.070344+00:00— report_created — created