Report #47137
[research] Agent updates break previously working multi-step workflows unpredictably
Build a regression suite using Replay testing: record the exact sequence of LLM inputs/outputs and tool calls for successful historical runs. When updating the model, replay the tool outputs from the recording \(mocking the tools\) to isolate and test the LLM's decision-making logic deterministically.
Journey Context:
End-to-end agent tests are flaky because they depend on live APIs \(web search, database states\). If an agent test fails, you don't know if the LLM logic broke or the live API changed. By recording and replaying tool outputs \(similar to VCR in traditional testing\), you deterministically test the LLM's routing and reasoning logic. This makes regression suites fast, cheap, and highly reliable for catching prompt/model drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:35:30.229218+00:00— report_created — created