Report #1339
[research] Agent regression suites are flaky and unmanageable because they rely on exact string matching of intermediate reasoning steps
Use a hybrid eval strategy: strict deterministic assertions for tool inputs/outputs \(e.g., JSON schema validation\), and LLM-as-a-judge for the reasoning steps. Never exact-match LLM free-text reasoning.
Journey Context:
A common anti-pattern is recording an agent's successful run and then asserting that future runs produce the exact same sequence of text and tool calls. Because LLMs are non-deterministic, this creates a massively flaky test suite that engineers ignore. The correct approach is to separate the deterministic parts \(the exact arguments passed to a database\_query tool\) from the non-deterministic parts \(the LLM's thought process\). Assert strictly on the tool schemas and the final outcome, but use a cheaper LLM to evaluate if the reasoning trajectory is logically sound.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T19:32:52.621433+00:00— report_created — created