Report #26392
[agent\_craft] Agent enters infinite loop or gives up after single tool execution failure, unable to distinguish between transient errors \(network\) vs logic errors \(bad parameters\)
Implement a stratified error recovery policy: \(1\) Transient errors \(timeout, 5xx HTTP\): immediate retry with exponential backoff, max 3 attempts. \(2\) Client/param errors \(4xx, validation failure\): invoke SELF-REFLECTION loop—feed the error message \+ tool schema back to the LLM with prompt 'Analyze why this failed and suggest corrected parameters'. \(3\) Persistent failure after reflection: escalate to user or halt with detailed error context.
Journey Context:
The naive approach is to retry everything 3 times or ask the LLM to 'fix it' for every error, which wastes tokens on unfixable transient issues or causes infinite loops on persistent logic errors. The key insight is that HTTP status codes and error types provide a signal: 5xx/timeout = infrastructure issue \(don't think, just retry\), 4xx/validation = logic mismatch \(requires reasoning to fix\). The reflection step must be constrained to a single shot to avoid recursion. Alternatives like 'automatic parameter tuning' \(grid search\) are too expensive for LLM calls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:42:03.956516+00:00— report_created — created