Report #51904
[research] Updating a tool's API or description breaks the agent's ability to call it, but goes unnoticed until production
Create a regression eval suite that specifically tests the agent's tool-selection and argument-parsing against a golden dataset of user intents mapped to expected tool calls, independent of tool execution.
Journey Context:
Standard unit tests check if the tool code executes correctly, but agents break when a tool description changes in a way that misleads the LLM, or a parameter type changes. You need an LLM-in-the-loop eval that checks if the model still chooses the right tool and formats the JSON payload correctly, without actually executing the side-effecting tool.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:37:01.542416+00:00— report_created — created