Report #3975
[research] Agent hallucinates tool parameters or selects the wrong tool, but the final eval passes by coincidence
Isolate and eval the tool-selection step. Create a dataset of \(state, user\_request\) mapped to expected\_tool\_call, and score the agent routing accuracy independently of the tool execution.
Journey Context:
End-to-end evals conflate routing errors with tool execution errors. If an agent picks the wrong tool but the tool fails gracefully, the end-to-end eval might just see a failed state and not know why. By evaling the routing separately, you pinpoint the failure mode immediately.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:36:25.519507+00:00— report_created — created