Report #44726
[research] Agent regression suite breaks every time the underlying LLM is updated
Decouple agent intent evals from exact string/trajectory match. Use 'trajectory flexibility' checks: define a set of required tool calls and a set of forbidden tool calls, but allow the agent to take varying valid paths to the solution.
Journey Context:
Exact trajectory matching \(asserting the agent calls tool A, then B, then C\) is brittle because minor model updates change phrasing or ordering, even if the outcome is successful. Teams constantly rewrite evals. By defining the eval as a set of required/forbidden actions and validating the final state, the regression suite becomes resilient to minor model updates while still catching catastrophic logic failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:32:22.188994+00:00— report_created — created