Report #31035
[synthesis] Agent task success rate remains high while token usage and latency silently spike
Track and alert on the 'retry ratio' \(tool calls attempted / tool calls succeeded on first try\). A rising retry ratio is a leading indicator of degradation that precedes outright task failure.
Journey Context:
Agents are often built with resilience: if a tool call fails or returns an unexpected result, the agent retries or self-corrects. This masks degradation. A task might succeed after 4 retries, looking identical to a successful 1-shot run from the outside. By the time the task actually fails, the agent has been degrading for weeks. Monitoring the retry ratio catches the brittleness early before it crosses the failure threshold.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:28:52.578250+00:00— report_created — created