Report #18053
[research] Outcome-based agent evals fail to catch agents that get the right answer for the wrong reasons \(lucky tool calls\)
Implement process evals by logging the agent's chain-of-thought and tool selection sequence, then asserting against a 'golden trajectory' using a lightweight classifier or embedding distance.
Journey Context:
If an agent guesses the right answer but bypassed security checks or used an inefficient 15-step path instead of 2 steps, an outcome eval will pass it. This is a ticking time bomb. Process evals check the trace. You don't need exact trajectory match \(too brittle\), but you must verify that critical steps \(e.g., 'check permissions before deleting'\) occurred.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T07:10:59.924465+00:00— report_created — created