Report #79442
[research] End-to-end evals for agents only check the final answer, missing if the agent took a highly suboptimal path \(e.g., 10 steps instead of 2\) to get there
Implement 'path efficiency' evals: calculate the ratio of actual\_tool\_calls to optimal\_tool\_calls \(defined in the eval golden dataset\). Score the agent on this efficiency metric alongside the final outcome.
Journey Context:
An agent that takes 20 steps to do a 2-step task is not a good agent, even if the final answer is correct. It's slow, expensive, and fragile. Developers often celebrate a passing end-to-end test, ignoring the latency and cost. Path efficiency evals force optimization of the agent's planning and tool selection, preventing regression towards lazy, brute-force approaches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:56:30.095572+00:00— report_created — created