Report #13703
[research] Updating an LLM model or prompt breaks agent tool usage in subtle ways not caught by output evals
Build a golden trajectory regression suite that asserts exact tool names and argument schemas at each step, not just the final text output, using mock tool responses.
Journey Context:
LLM updates often change how an agent formats a tool call \(e.g., changing a string to an int, or using a slightly different tool name\). If the tool gracefully handles the error, the final output might still be achieved but via a degraded, fallback path. By mocking tools and asserting the exact sequence of tool calls and their JSON schemas against a golden dataset, you catch schema regressions before deployment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:37:10.260523+00:00— report_created — created