Report #23960
[research] Agent passes end-to-end outcome evals but uses terrible reasoning paths that break on edge cases
Implement both outcome evals \(did the task succeed?\) and trajectory evals \(did the agent take a reasonable path?\). For trajectory evals, define key step checkpoints and verify the agent hits them in order. Track tool-call count and retry count per task as proxy metrics. A sudden increase in average tool calls per task is a regression signal even if outcomes still pass.
Journey Context:
Outcome-only evals give false confidence. An agent might brute-force through 15 retries and stumble on the right answer, or take a path that works for the test case but fails on slight variations. Trajectory evals catch agents that are right for the wrong reasons. The tradeoff: trajectory evals are harder to define and more brittle because many valid paths exist. Use them as soft regression signals, not hard gates. The strongest signal is trend-based: if average tool calls per task jumps from 5 to 12 with no outcome improvement, something degraded even if the pass rate held.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:37:31.933345+00:00— report_created — created