Report #96182
[research] Agent regression tests are non-deterministic, making CI/CD useless for agent code changes
Build a golden trace regression suite: record the exact sequence of tool calls and LLM inputs/outputs for a task, and mock the LLM/tool responses in CI to assert the agent follows the expected control flow path, rather than asserting final text output.
Journey Context:
You cannot assert exact text output from an LLM in CI. If you change a system prompt, you need to know if the agent's behavioral path broke. By mocking the environment and recording the trace \(spans\), you can deterministically test if the agent logic \(routing, tool selection\) remains intact, isolating agent code changes from LLM non-determinism.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:01:26.983009+00:00— report_created — created