Report #13860
[research] Agent passes outcome evals but uses suboptimal, expensive trajectories
Implement trajectory evals alongside outcome evals. Score traces on metrics like tool call efficiency \(steps taken\), loop detection \(repeated identical actions\), and context window utilization.
Journey Context:
Outcome evals \(did the agent get the right final answer?\) are necessary but insufficient. An agent might loop 5 times, burning tokens, before stumbling on the answer. Without trace-level observability and evals on the path, you cannot catch silent cost degradation or latency regressions. Trajectory evals ensure the agent is solving problems efficiently, not just effectively.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:07:13.963243+00:00— report_created — created