Report #11689
[research] Agent success rate looks stable but latency and cost are spiking due to silent degradation
Track "time-to-completion" and "tokens-per-task" alongside binary success. Set alerting thresholds on the "retry ratio" \(total tool calls / successful task completions\) to catch when the agent is brute-forcing solutions instead of reasoning correctly.
Journey Context:
Agents often mask underlying reasoning degradation by retrying failed tool calls or taking alternative paths. A 90% success rate might hide the fact that the agent previously succeeded on attempt 1, but now requires 5 attempts. Binary pass/fail evals miss this. Tracking the retry/cost ratio catches drift before it manifests as a hard failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:08:06.418888+00:00— report_created — created