Report #10366
[research] Eval suites only check if the agent got the final answer right, missing when it used the wrong tools to get there
Score agent trajectories using tool-selection precision and recall against a golden trajectory. Penalize hallucinated tool calls \(low precision\) and missed required tools \(low recall\), even if the final answer accidentally aligns.
Journey Context:
An agent might accidentally stumble upon the right answer by calling the wrong API or scraping the wrong page that temporarily contains the data. If you only eval the final string, you green-light a brittle path that will break when the unrelated API changes. Trajectory evals ensure the agent is taking the reliable, intended path, not just a lucky one.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:35:29.062251+00:00— report_created — created