Report #45249
[research] LLM model upgrades silently break agent tool-calling logic
Build a golden-path regression suite of minimal tool-calling traces \(prompt \+ expected tool schema\) and run it as a CI check against any model version bump, using exact-match on tool name and schema, not string similarity.
Journey Context:
LLM updates often change how strictly a model adheres to a specific JSON schema for tool inputs. A model might suddenly wrap a string in an object or miss a required field. Traditional NLP evals \(ROUGE/BLEU\) miss this completely. You need exact schema validation regression tests to catch subtle structural drift before deployment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:25:11.184992+00:00— report_created — created