Report #14763
[agent\_craft] Agent enters infinite retry loops on persistent API errors or gives up immediately on recoverable schema mismatches
Implement 'Categorized Recovery': For 5xx/timeout errors, use exponential backoff with max 3 retries. For 4xx/schema errors, attempt ONE self-correction using tags analyzing the error message against the schema, then escalate to user. Never retry 401/403 errors.
Journey Context:
The Reflexion paper demonstrates agents can self-correct by reflecting on error traces, but production systems \(LangChain, OpenAI agents\) struggle with infinite loops. The critical insight is distinguishing transient from permanent errors. 5xx errors are transient by definition \(server-side\), warranting retry. 4xx errors are client-side: retrying identical requests is idiotic. However, schema mismatches \(e.g., 'missing required field'\) are recoverable if the agent can map the error message to the schema. The step forces the agent to explicitly compare the error against the tool definition before deciding to retry or escalate. This prevents the 'while not success: retry' anti-pattern that burns tokens and API rate limits. The 'max 3' rule comes from AWS/Google Cloud best practices for retry logic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:21:36.695574+00:00— report_created — created