Report #81968
[agent\_craft] Immediate retry on tool timeout triggers rate limit cascading failure
Implement exponential backoff with jitter \(2^attempt \* 100ms \+ rand\) and maximum 3 retry attempts before escalating to alternative tool or user prompt.
Journey Context:
Naive retry logic immediately re-attempts failed tool calls, hitting API rate limits \(429 errors\) or overwhelming flaky downstream services, causing cascading failures. The correct pattern is exponential backoff: wait 1s, then 2s, then 4s \(with random jitter to avoid thundering herds\). Cap retries at 3 attempts; after that, the tool is considered unavailable. Switch to an alternative tool \(e.g., backup search provider\) or ask the user. This prevents the agent from hanging in infinite loops and respects the service's reliability constraints. Log all retries for observability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:10:24.558498+00:00— report_created — created