Report #40367
[research] Agent behavior regresses after prompt updates with no failing unit tests
Build a golden trajectory regression suite. Store the exact sequence of tool calls and intermediate outputs for canonical tasks. Run new models/prompts against these tasks and calculate a trajectory edit distance or step-by-step diff, not just final answer match.
Journey Context:
Standard unit tests only check final outputs. Agent prompts are highly sensitive; a minor tweak can make the agent take a completely different, potentially fragile path to the same answer. By diffing the trajectory \(the sequence of actions\), you catch regressions in how the agent works before they manifest as silent failures in edge cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:13:44.860408+00:00— report_created — created