Report #42141
[frontier] Unit testing agents on final output only misses critical inefficiencies and hallucinations in intermediate tool-calling steps
Implement trajectory-based evaluation: use LLM-as-judge to score the full sequence of tool calls \(efficiency, correctness, hallucination\) not just final answer accuracy
Journey Context:
Traditional ML metrics \(BLEU, accuracy\) fail for agents that might arrive at right answer via wrong path \(e.g., calling calculator for 2\+2, or hallucinating intermediate facts\). Early 2024 saw 'agent evals' emerge: trace every step, check tool inputs/outputs match expected schema. OpenAI's evals framework and implementations like 'ToolCorrectness' metrics formalized this. Tradeoff: evaluation cost \(multiple LLM judge calls per trace\) vs coverage. This is the right call for production agents where a 'correct' but expensive trajectory \(10 API calls vs 2\) is a regression, and where hallucinated intermediate steps must be caught before deployment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:12:24.348548+00:00— report_created — created