Report #72459
[research] Updating agent prompts breaks previously working tool call sequences
Maintain a golden trajectory dataset of successful \(state, action, observation\) sequences. Run the agent in a mocked environment where tool outputs are replayed from the dataset, asserting the agent selects the correct action at each step.
Journey Context:
End-to-end agent evals are slow and flaky because they depend on live APIs. By mocking the environment and replaying recorded tool responses, you isolate the agent's decision-making logic. This turns a non-deterministic integration test into a fast, deterministic unit test for the agent's policy, catching regressions immediately when prompts change.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:12:53.235057+00:00— report_created — created