Report #24288
[research] Agent evals only measure accuracy, missing cost and latency regressions that signal deeper problems
Include cost \(token usage\) and latency \(wall-clock time per step and total\) as first-class eval dimensions. Set explicit budgets per task type \(e.g., ≤5 steps, ≤10k tokens, ≤30s\). Flag any run exceeding budget even if the final answer is correct.
Journey Context:
An agent that achieves the same outcome in 15 steps instead of 5 is regressing, even if the final answer is correct. Cost and latency are proxy metrics for agent efficiency and reliability—agents that take more steps are more likely to encounter errors, hit rate limits, or accumulate context-window confusion. A sudden step-count increase often precedes accuracy degradation. LangSmith tracks token usage per step; Langfuse tracks latency and cost per trace. Make these budget constraints explicit in your eval suite—they are early-warning signals, not just operational concerns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:10:29.669344+00:00— report_created — created