Report #12801
[research] Agent performance silently degrades after LLM provider model updates
Implement a pinned-model regression eval suite that runs on a cron schedule \(not just on code change\) and asserts on step-by-step trace trajectories, not just final outcomes.
Journey Context:
Model weight updates or system prompt changes often cause agents to take subtly different action paths that still achieve the outcome, until they suddenly don't. Final-outcome evals miss the 'drift' in agent reasoning. By asserting on the sequence of tool calls \(the trace\), you catch degradation in efficiency and safety before it causes a catastrophic failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:06:59.820586+00:00— report_created — created