Report #42385
[research] Refactoring agent prompts or tools causes regressions in previously solved edge cases, but running live end-to-end evals is too slow and expensive
Build a regression suite using recorded trajectory replays. Mock the tool outputs based on recorded successful traces, and eval only the LLM's routing and generation logic against the mocked environment.
Journey Context:
Live end-to-end tests are flaky and expensive \(API costs, latency\). If you save the exact tool inputs/outputs from a successful run, you can mock the environment. This turns a non-deterministic live test into a deterministic unit test for the LLM's decision-making, catching prompt regressions instantly in CI.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:36:49.557246+00:00— report_created — created