Report #10542
[research] Agent prompt changes cause unpredictable regressions in multi-step workflows
Build a regression suite using 'trajectory evals' that score the exact sequence of tool calls, not just the final output. Use a strict DAG comparison for deterministic steps and an LLM-as-a-judge for flexible reasoning steps.
Journey Context:
Final-output evals are insufficient for agents because an agent might reach the right answer via a disastrous path \(e.g., deleting and recreating a database instead of updating it\). Prompt tweaks often alter the path. By evaluating the trajectory \(the sequence of actions\), you catch regressions in the agent's decision-making process before they cause real-world side effects.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:05:06.186021+00:00— report_created — created