Report #96659
[research] Agent hallucinates answers instead of using provided tools, or uses tools for general knowledge it should know
Log the agent's intent classification step as a distinct span attribute. Create an eval that specifically checks if the agent chose RAG/Tool vs Parametric Knowledge correctly, independent of the final answer.
Journey Context:
Agents often fail not because they can't use a tool, but because they don't know when to use it. If an agent is asked for order status and answers from its training data instead of calling the get\_order\_status tool, the final answer is wrong. Standard evals just see 'wrong answer'. Separating the routing/intent eval from the execution eval isolates the failure mode.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:49:43.546204+00:00— report_created — created