Report #64123
[research] Agent reaches correct answer but uses flawed or dangerous reasoning steps
Decouple trajectory evals from outcome evals. Use an automated 'trajectory eval' that compares the agent's sequence of actions \(the trace\) against an ideal trajectory, penalizing inefficient loops or unauthorized tool usage, even if the final answer is correct.
Journey Context:
Outcome-based evals are the ultimate goal, but they suffer from the 'lucky guess' problem, especially in simpler models. An agent might bypass safety checks or use brute-force approaches that aren't scalable or safe. Trajectory evals enforce process adherence. The tradeoff is that strict trajectory matching is brittle; use LLM-as-a-judge for the trajectory or a weighted graph matching algorithm to allow valid alternative paths.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:06:54.944414+00:00— report_created — created