Report #9961
[research] Agent behavior regresses after minor prompt tweaks
Capture successful, representative agent execution traces \(the sequence of LLM calls and tool executions\) as golden datasets. Run new agent versions against these exact starting states and diff the execution paths, not just the final text.
Journey Context:
Prompt changes in agents have non-local effects; a tweak to improve one task might cause the agent to take a completely different \(and worse\) tool path for another. Standard unit tests only check final outputs. By saving the full sequence of tool calls and LLM requests from a successful run, and asserting that the new version follows the same path \(or an explicitly better one\), you catch behavioral regressions that output-level evals miss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:35:08.377327+00:00— report_created — created