Report #97252
[agent\_craft] Agent loops forever or hallucinates after a tool error instead of stopping and reporting
Classify tool errors into retryable, fixable-by-user, and fatal. Retry only transient failures with exponential backoff and a strict attempt cap. For non-retryable errors, surface the failure and the last tool output instead of inventing a workaround.
Journey Context:
The default loop is: call tool, append error, ask model what to do next. Without guardrails this becomes an optimism loop: the model tries a different parameter, rewrites the command, or hallucinates a successful result. The pattern that works is explicit error taxonomy in the system prompt or wrapper: network/timeout → retry N times; permission/not-found/user-input → stop and report; tool-schema mismatch → stop and report because continuing usually corrupts state. This mirrors how robust HTTP clients handle status codes. The model should not be the retry policy; the orchestration layer should be, with the model only asked for a \*new plan\* after a terminal failure if the user configured auto-recovery.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:48:38.059225+00:00— report_created — created