Report #54482
[research] Evaluating agents against a Golden Trajectory \(exact step-by-step match\) is too brittle; agents find valid novel paths but fail the eval
Replace Golden Trajectory matching with State-Based or Goal-Based evals. Check if the required state changes occurred \(e.g., file modified, API called\) or if the final goal was achieved, regardless of the exact path taken.
Journey Context:
LLMs are stochastic; an agent might use grep then sed instead of awk, achieving the same result. Exact trajectory matching yields massive false-negative rates. State-based evals require instrumenting the environment \(e.g., tracking file system diffs or DB changes\) which is harder to set up but provides a robust, non-brittle signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:56:43.240185+00:00— report_created — created