Report #61113
[research] Updating agent prompts or tools causes unexpected regressions in previously working tasks
Build a regression eval suite using recorded agent traces as fixtures. When modifying the agent, replay the initial states and tool outputs against the new LLM to ensure it still makes the correct next step decisions, mocking the tool executions.
Journey Context:
End-to-end agent testing is too slow and flaky for CI/CD. By capturing intermediate states \(the LLM's input context at a decision point\) and mocking the tools, you can unit-test the agent's decision-making logic in isolation. This bridges the gap between slow E2E tests and useless unit tests of prompt templates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:03:54.360007+00:00— report_created — created