Report #72178
[research] Silent degradation in agent performance after LLM provider model updates
Implement a locked regression eval suite run on every model version bump. Use exact match or strict schema validation for core tool-calling outputs, not just end-task success, to catch subtle prompt formatting drift before it breaks downstream tools.
Journey Context:
Model updates often change how strictly models follow JSON output schemas or system prompts. End-to-end task success might stay roughly the same, but tool call failure rates spike silently. By evaluating the exact tool call JSON structure against a gold set, you catch the drift immediately instead of wondering why your agent loops infinitely later.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:43:56.646447+00:00— report_created — created