Report #9585
[research] Agent behavior breaks unexpectedly after LLM provider model updates
Implement a pinned-model regression eval suite that runs against a golden dataset of tool-call traces, asserting exact tool names and argument schemas rather than just final text output.
Journey Context:
Model updates often change how an agent formats JSON arguments for tools, causing silent breakages. Evaluating only the final natural language output misses tool-call formatting regressions. By asserting against the structured tool-call traces in your observability platform, you catch breaking changes before they reach production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:38:16.206826+00:00— report_created — created