Report #8433
[research] Agent regression suite fails because the agent took a different but equally valid path to the solution
Evaluate agent regression suites using task completion state \(goal-state evaluation\) rather than trajectory matching, and use embedding distance or LLM-judged equivalence for intermediate step validation.
Journey Context:
Traditional software regression tests assert exact execution paths. Agents are probabilistic and might solve a coding task by editing file A then B, instead of B then A. Strict trajectory matching yields massive false-positive failure rates. You must decouple goal achievement from path taken. Only enforce trajectory constraints where strict ordering is a business requirement \(e.g., authorization before mutation\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:34:50.127482+00:00— report_created — created