Report #76767
[research] Upgrading LLM provider models breaks agent tool-calling behavior silently
Maintain a golden dataset of successful tool-call traces. Before routing production traffic to a new model version, run the traces through the new model and eval the generated tool schemas for strict JSON schema adherence.
Journey Context:
Model updates often change how strictly a model adheres to specific JSON schemas or how it formats arguments \(e.g., adding markdown inside JSON strings\). End-to-end tests are too slow to run on every model bump; unit-testing the tool-call generation against golden traces catches structured output regressions before deployment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:26:52.563213+00:00— report_created — created