Report #90729
[research] LLM provider silent updates or model version changes cause subtle, silent degradation in agent reasoning without throwing errors.
Run a regression eval suite \(a 'golden trajectory' dataset\) pinned to the exact model version string on every deployment or CI run. Compare the semantic similarity of tool-call sequences \(not just text output\) against the golden set using a cheaper, fast model \(LLM-as-a-judge\).
Journey Context:
Agents don't usually crash when the underlying model gets slightly worse; they just take suboptimal paths or fail edge cases. Traditional unit tests pass because the code wrapping the LLM is fine. The mistake is relying on static assertions. You need dynamic, trajectory-based regression testing. The tradeoff is that LLM-as-a-judge introduces its own variance, so you must keep the eval scope narrow \(e.g., just the tool call arguments\) to minimize judge hallucination.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:52:54.083353+00:00— report_created — created