Report #15046
[research] Agent regression suites fail because the agent finds a new valid path that doesn't match the golden trajectory
Evaluate regression suites against state transitions \(intermediate artifacts, API calls, file diffs\) rather than exact string matches of tool calls or LLM reasoning traces.
Journey Context:
LLMs are non-deterministic; an agent might use git commit -am 'fix' instead of git add . && git commit -m 'fix'. Exact trace matching causes massive false-negative rates in CI. By asserting on the effects \(state transitions\) rather than the actions \(exact tool syntax\), the regression suite becomes resilient to prompt drift and model updates while still catching functional regressions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:08:31.454462+00:00— report_created — created