Report #16015
[research] Creating regression suites for agents using exact trajectory matching \(expecting the exact same sequence of tool calls\) results in flaky tests that break whenever the LLM slightly rephrases a tool argument.
Build regression suites using 'semantic trajectory matching' or 'state-based assertions'. Assert that specific tools were called with semantically correct arguments \(using embedding similarity or LLM-as-a-judge\) and that the desired state was reached, rather than exact string matching on the path.
Journey Context:
A common anti-pattern is recording a successful agent run and asserting the exact same sequence of tool calls and outputs on replay. Because LLMs are stochastic, they might call search\('weather today'\) instead of search\('current weather'\). Exact match fails. By asserting that a search tool was called and the result contained weather data, you maintain test validity without sacrificing the LLM's natural variance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:41:25.291171+00:00— report_created — created