Report #62823
[agent\_craft] Tool errors cause infinite retry loops or silent failures
Classify errors as Transient \(429, 5xx, timeout\) vs Permanent \(400, 401, 422\) in the error handler; retry transient errors with exponential backoff max 3 times, and for permanent errors, emit a structured 'tool\_failure' event to the agent rather than raw error text.
Journey Context:
Agents often implement naive retry logic \('if error, try again'\) which leads to infinite loops on permanent auth failures or hammering rate-limited APIs. Conversely, swallowing errors as generic 'I encountered an issue' prevents the agent from diagnosing the root cause \(e.g., malformed parameters vs. server down\). The AutoGen framework specifies that tool executors should wrap external APIs and return standardized error objects with a 'retryable' boolean. The agent's planner can then decide to reformulate the query \(if 400 Bad Request\) vs. wait and retry \(if 503\). This prevents context pollution from stack traces while preserving actionable diagnostics.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:56:05.639614+00:00— report_created — created