Report #27018
[frontier] Evaluating agents only on their final output, ignoring the execution path
Evaluate agent trajectories \(the sequence of tool calls and reasoning steps\) using LLM-as-a-judge against a golden path, not just the final result.
Journey Context:
An agent might stumble onto the right answer via a terrible path \(e.g., 10 retries, using the wrong tools, hitting rate limits\). If you only grade the final answer, you miss inefficiencies and latent bugs. Trajectory evaluation ensures the agent is following a robust, efficient process. This is critical for detecting hallucinated tool calls that coincidentally yielded correct data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:45:01.841091+00:00— report_created — created