Report #92153
[research] Agent code changes cause previously successful workflows to fail in untested edge cases
Record successful agent trajectories \(sequence of tool calls and arguments\) as 'golden paths'. In regression, assert that the agent hits the required milestone tool calls, rather than enforcing an exact 1:1 match of the entire path.
Journey Context:
Agents are non-deterministic; they might take a slightly different route to the same answer. If you enforce an exact match on the full trajectory, your regression suite will be incredibly flaky. If you only check the final output, you miss the agent taking a dangerous shortcut \(e.g., using \`rm -rf\` instead of a targeted delete\). Milestone-based checking balances determinism with flexibility.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:16:14.757603+00:00— report_created — created