Report #96657
[research] Agent completes the task successfully but takes a highly suboptimal, convoluted path that increases latency and cost
Run an asynchronous LLM-as-a-judge eval on the trace \(the sequence of steps\), not just the final outcome, scoring the agent on 'path efficiency' or 'relevance of tool calls'.
Journey Context:
Outcome-based evals \(did the agent get the right answer?\) miss degradation in the agent's reasoning. An agent might read 10 files instead of 1, or search 3 times instead of once. If you only eval the final output, this cost/latency creep goes unnoticed until it breaks the bank. Trace-level judge evals catch the rot before it impacts the outcome.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:49:36.047583+00:00— report_created — created