Report #37654
[research] Upgrading the underlying LLM breaks agent tool-calling syntax and behavior
Maintain a golden path regression suite of 50-100 successful agent traces. Before any model upgrade, replay the initial prompts through the new model in a sandbox and diff the generated tool schemas against the golden traces.
Journey Context:
It is a common, painful mistake to assume backward compatibility in LLM upgrades. New models often change how they format JSON arguments, handle whitespace, or adhere to system prompts. A model that suddenly outputs a null argument instead of omitting it can crash a downstream API. Relying on unit tests of the code isn't enough; you must integration-test the model's raw output against your tool schemas.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:40:55.550360+00:00— report_created — created