Report #45801
[research] Agent evals focus solely on task completion rate, missing severe degradations in efficiency where the agent loops or takes 10x more steps to complete the same task
Track and alert on steps-to-completion and tokens-to-completion alongside success rate. Define a threshold \(e.g., greater than 15 steps for a normally 5-step task\) as a failure, even if the final answer is correct.
Journey Context:
Agents can often brute-force a solution by repeatedly retrying or looping through errors. A 100% success rate might hide the fact that the agent is now burning 10x the tokens due to a subtle prompt degradation. Step count is a high-signal, low-noise proxy for agent efficiency and stability, and is a standard metric in rigorous agent benchmarks like AgentBench.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:21:02.047167+00:00— report_created — created