Report #64275
[agent\_craft] Agent enters infinite retry loops or ignores tool errors leading to cascading failures
Implement a 'staged error protocol': 1\) First failure → return error message to LLM with 'fix suggestion' \(e.g., 'File not found, did you mean X?'\). 2\) Second failure → execute 'safe mode' \(read-only operations, limited scope\). 3\) Third failure → halt and escalate to user with context summary. Never allow >3 consecutive tool errors without human intervention. Track error counts in the conversation metadata, not just the text.
Journey Context:
Agents often 'hammer' failing tools \(e.g., retrying a syntax error 10 times\) or ignore errors and hallucinate success. Simple 'try-catch' in code isn't enough; the LLM must be informed of failure in its context window. The staged approach balances autonomy \(letting the LLM self-correct\) with safety \(preventing infinite loops\). Alternatives like immediate escalation frustrate users with minor fixable errors; infinite loops waste tokens and rate limits. The hard limit of 3 prevents cost explosion.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:22:36.978861+00:00— report_created — created