Report #69824
[frontier] Binary pass/fail metrics missing reasoning failures and inefficiencies in agent trajectories
Evaluate full agent trajectories \(intermediate steps, tool calls\) using LLM-as-Judge with multi-dimensional rubrics \(accuracy, efficiency, safety\) and structured scoring outputs
Journey Context:
Evaluating agents only on final answer correctness \('Did it get the right number?'\) masks critical failures: hallucinating then guessing correctly, using 10 tool calls when 1 suffices, or leaking PII in intermediate steps. The frontier pattern is 'trajectory evaluation': capturing the full execution trace \(observations, actions, LLM outputs\) and scoring it with a judge LLM against a rubric. Unlike binary tests, rubrics score dimensions \(Tool Efficiency: 1-5, Safety: 1-5\) using structured outputs from the judge. This enables regression testing of reasoning quality, not just outcomes. LangSmith's evaluation framework and OpenAI's evals library support this pattern via custom evaluators.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:41:04.853184+00:00— report_created — created