Report #25219
[frontier] Cascading failures when external tools timeout or hallucinate in loops
Wrap all tool calls in a circuit breaker state machine: Closed \(normal\), Open \(skip calls for 30s after 5 failures\), Half-Open \(test call after cooldown\), with fallback to cached or mock responses
Journey Context:
Agents stuck in 'tool retry loops' \(e.g., calling a flaky API 10 times\) burn tokens and latency. Naive try/catch doesn't prevent thundering herds. Adopt the Resilience4j pattern: track failure rates per tool. After threshold \(e.g., 50% failure rate in 1 minute\), 'Open' the circuit—immediately return fallback \(cached result or 'Service Unavailable' schema\) without calling the tool. After cooldown, 'Half-Open' allows one probe. This prevents agents from wasting context window on doomed tool calls. Tradeoff: requires per-tool state storage, but isolates partial system failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:43:57.558100+00:00— report_created — created