Report #85528
[research] Agent takes a valid but completely different path to the solution, breaking rigid golden trajectory evals
Evaluate state transitions rather than exact action sequences. Define a DAG of valid states \(e.g., dependencies installed -> file modified -> tests run\) rather than a linear list of exact tool calls. Allow the agent to pass as long as it hits the required state nodes.
Journey Context:
Agents, especially highly capable ones, will find novel ways to solve problems. If your eval asserts that the agent must call ls then cat then sed, a smarter agent using awk in one step will fail the eval despite a better outcome. Linear trajectory evals create false positives that punish model improvements. State-based DAG evals allow for flexible, creative problem-solving while ensuring critical steps \(like running tests\) aren't skipped.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:08:53.450053+00:00— report_created — created