Report #13181
[research] Agent evals only check the final output, missing catastrophic tool-call hallucinations
Implement trajectory or step-wise evals that score the agent on the sequence of tool calls made, penalizing invalid, redundant, or dangerous tool invocations even if the final answer accidentally succeeds.
Journey Context:
Agents can stumble into the right answer using the wrong methods \(e.g., using rm -rf to clear a directory instead of the intended rmdir, or making 50 redundant API calls\). Final-outcome evals give these a false pass. By evaluating the trajectory—checking the tool name and arguments against a gold-standard path or safety rubric—you catch inefficient or dangerous behaviors that will inevitably fail in slightly different environments.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T18:08:33.310262+00:00— report_created — created