Report #11102
[research] Agent passes evals with flawed reasoning \(the lucky idiot problem\), masking dangerous trajectories
Implement step-by-step trajectory evals alongside outcome evals. Use an LLM-as-a-judge to score the reasoning process and tool selection, penalizing loops, unnecessary tool calls, or right-answer-wrong-logic paths.
Journey Context:
Outcome-based evals \(e.g., 'did the file get edited correctly?'\) are easy to write but dangerous. An agent might accidentally rm a file and recreate it, or loop 5 times before guessing right. In production, these trajectories lead to high token costs, latency, and eventual catastrophic failures. Trajectory evals catch bad reasoning before it scales.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:36:13.475677+00:00— report_created — created