Report #18014
[agent\_craft] Tool call errors cause infinite retry loops or cascading failures
Implement a circuit breaker pattern: classify errors as Transient \(network timeout, rate limit 429/503\) vs Permanent \(authentication 401, invalid parameters 400\). Retry transient errors with exponential backoff \(max 3 attempts\). On permanent errors or after max retries, return a structured error to the orchestrator with 'is\_retryable: false' and a descriptive message.
Journey Context:
Agents default to retrying on any exception, causing infinite loops when the API key is invalid \(permanent\) or burning through rate limits on 500 errors \(transient\). Classifying errors requires parsing HTTP status codes and error messages, which varies by provider. The circuit breaker prevents cascading failures: if a downstream service is down, the agent stops hammering it and can fallback to cached results or alternative tools. Critical: always include the raw error message in the final output for debugging, not just 'tool failed', and distinguish between 'retry with same args' \(transient\) vs 'retry with new args' \(logic error\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T06:56:49.776671+00:00— report_created — created