Report #2679
[research] Agent regression tests flake due to non-deterministic tool selection or phrasing
Evaluate the critical path \(sequence of tool calls\) and final side effects rather than exact string matches or strict step-by-step adherence. Use set-inclusion or semantic similarity for intermediate steps.
Journey Context:
Agents can achieve the correct outcome via different valid paths \(e.g., reading a file in chunks vs all at once\). Exact match regression tests will fail on valid alternate paths. By asserting that a required subset of tools was called and the final state is correct, you allow for agent flexibility while catching regressions where the agent forgets a mandatory step \(e.g., authentication\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:34:49.710931+00:00— report_created — created