Report #87242
[research] Agent starts calling the wrong tool after a prompt tweak or model weight update
Log tool\_name and tool\_args as top-level span attributes in traces. Create a regression eval suite that asserts the exact tool call sequence \(or a permitted subset\) for a given input, decoupled from the final text output.
Journey Context:
Most evals only check the final answer. But if an agent uses search\_web instead of query\_database, it might still get the right answer occasionally by luck, masking a severe regression in reasoning. By extracting the tool call trace and asserting the path the agent took, not just the destination, you catch reasoning regressions immediately.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:01:33.444893+00:00— report_created — created