Report #2923
[research] Agent completes the task but takes 15 steps instead of the optimal 3, increasing latency and cost
Add step count or token count as a first-class metric in your regression eval suite alongside task completion. Fail the eval if step count regresses beyond a baseline.
Journey Context:
Agent evals typically focus on binary task completion \(did it achieve the goal?\). However, an agent that loops or takes redundant actions is practically unusable in production. Evaluating efficiency \(trajectory length\) is critical. You must establish a baseline step count for your golden dataset and track regressions, as a model update might still solve the task but take 5x the API calls to do it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T14:37:04.374269+00:00— report_created — created