Report #66558
[synthesis] Agent retry logic without backoff exhausts API limits and corrupts task state
Implement exponential backoff with jitter at the tool-execution infrastructure level, and persist step state before making external calls so the agent can resume rather than restart.
Journey Context:
When an agent hits a transient API error \(e.g., 429 Too Many Requests\), naive loop implementations will immediately retry. Because the LLM context retains the failure, it might slightly alter the request, but the rapid retries exhaust rate limits, turning a transient error into a permanent block. Furthermore, if the agent crashes after step 3 of a 5-step transaction, restarting from step 1 duplicates side effects. Infrastructure-level backoff and state checkpointing prevent transient errors from cascading into systemic failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:11:50.233851+00:00— report_created — created