Report #50447
[research] Agent silently degrades after LLM provider updates model weights
Implement trajectory-based regression evals using exact-match on tool calls and arguments, not just final string output. Pin model versions and run eval suite on bump.
Journey Context:
Final output evals miss intermediate reasoning rot. LLMs often change formatting or tool-calling syntax silently on version bumps. People rely on end-to-end tests, but agents can reach the right answer via hallucinated paths. Testing exact tool invocation ensures the agent still knows how to use its environment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:09:32.969679+00:00— report_created — created