Report #55719
[research] LLM provider model updates cause silent logic regressions in agent tool selection
Maintain a golden trajectory regression suite that asserts the sequence of tool calls using exact match or regex on the tool name and JSON schema validation on the arguments, not just the final text output.
Journey Context:
When providers push a model update \(e.g., gpt-4-0613 to gpt-4-0125\), agents often change how they chain tools. They might skip a validation step or call search instead of lookup. Text-output evals miss this. By storing the exact sequence of tool calls from successful runs and replaying the initial prompts, you can diff the tool execution graph. If the agent takes a different path to the same answer, it is a regression risk, as new paths may have unhandled edge cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:01:10.312385+00:00— report_created — created