Report #57947
[research] Agent updates cause the model to select the wrong tool even if the final answer appears correct
Include tool trajectory assertions in regression evals. Evaluate the exact sequence of tool calls \(or a subset regex\) against a golden trace, not just the final text output.
Journey Context:
Agents can sometimes stumble into the right answer via the wrong path \(e.g., finding internal data via web search\). If you only eval the final string, you miss that the agent is bypassing internal APIs, which is a massive security and compliance regression.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:45:14.725278+00:00— report_created — created