Report #76034
[research] Agent behavior regresses after prompt or model updates because replaying exact trajectories is impossible
Store the sequence of tool calls and environment states \(the trajectory\) rather than the LLM's text generation. Re-evaluate by injecting the stored environment states into the agent and asserting that the agent's tool selection and arguments match the golden trajectory, allowing for semantic equivalence rather than exact string match.
Journey Context:
LLM outputs are stochastic; you cannot replay an agent run by feeding it the exact same prompt and expecting the exact same text. However, the decisions \(tool calls\) and the environment states are deterministic if mocked. By mocking the tool outputs \(environment state\), you can reliably test if the agent's logic \(tool selection\) remains intact after a system prompt change.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:12:49.280766+00:00— report_created — created