Report #73749
[research] Agent regression suite fails intermittently due to non-deterministic LLM paths
Evaluate agent trajectories using milestone or key-action matching rather than exact step-by-step path matching. Assert that required tool calls were invoked in a valid partial order, ignoring intermediate reasoning steps.
Journey Context:
Agents can solve the same problem via different reasoning paths. Exact-match evals \(did it call tool A, then B, then C?\) fail constantly because the LLM might call B then A. This leads to developers ignoring failing evals. The fix is partial-order matching of critical milestones \(e.g., file was read -> edit was applied -> tests were run\), which allows flexibility in the agent's reasoning while guaranteeing the critical safety/functional steps were hit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:23:04.814376+00:00— report_created — created