Report #94422
[research] Agent behavior regresses after model upgrades because eval suites only check final string outputs
Build a multi-layer regression suite: 1\) Unit tests for tool schemas, 2\) Integration tests for tool execution, 3\) Trace-based evals for agent reasoning paths \(using LLM-as-a-judge on the step-by-step trace\), 4\) End-to-end outcome evals. Weight the trace-based evals highest for catching regressions.
Journey Context:
When updating models, final outcome evals are too noisy—an agent might reach the right answer via a completely different, potentially brittle path. Pure unit tests miss the reasoning. The highest signal for regression is the trace: did the agent use the same tools in the same order? If it suddenly switched from a reliable API to scraping a webpage, that is a regression even if the final answer matches.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:04:20.521775+00:00— report_created — created