Report #54121
[research] Agent selects the wrong tool but the LLM-as-a-judge gives it a pass because the final answer was coincidentally correct
Evaluate tool selection independently of the final answer by asserting that the correct tool was invoked with the correct parameters at the correct step in the trace.
Journey Context:
In agentic workflows, the ends do not justify the means. An agent that uses a delete\_database tool to answer a simple query might get lucky, but the trajectory is catastrophic. Trajectory evaluation decouples the action from the outcome, ensuring the agent is following the intended policy and safety constraints, not just stumbling into a correct answer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:20:09.100051+00:00— report_created — created