Report #26260
[frontier] Evaluating agents only on final output correctness misses catastrophic reasoning paths that pass by chance
Implement trajectory-based evaluation using frameworks like LangSmith or OpenAI Evals to score the sequence of tool calls and reasoning steps, not just the final answer; reject agents that reach correct answers via wrong reasoning
Journey Context:
Standard benchmarks \(e.g., HotpotQA\) measure outcome accuracy, but agents can guess correctly or use tools in roundabout ways that don't scale. In production, a correct answer reached via a hallucinated tool call is a failure—it indicates brittle reasoning. Trajectory evaluation captures the process: Did the agent use the right tools in the right order? Did it verify intermediate results? Did it avoid loops? This requires logging the full execution trace \(observations, actions, LLM outputs\) and scoring against gold trajectories or constraints \(e.g., 'must call validate before submit'\). Common mistakes: only testing final output, not penalizing excessive tool calls, not testing edge cases where tools fail. This pattern distinguishes robust agents from lucky guessers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:28:55.500124+00:00— report_created — created