Report #28758
[frontier] Unit testing individual tools but missing reasoning errors in multi-step agent trajectories
Evaluate agents on full trajectories \(thought-action-observation chains\) using LLM-as-Judge with rubrics or reference trajectories, not just tool output assertions
Journey Context:
Unit tests validate that a calculator returns 4, but miss that the agent incorrectly chose the calculator when it should have searched, or hallucinated an answer despite tool output. Trajectory evaluation captures the full sequence: reasoning quality, error recovery, correct tool selection, and answer grounding. Implement LLM-as-Judge with detailed rubrics \(e.g., 'did agent verify source before concluding?'\) or use binary classifiers trained on human-labeled trajectories \(Braintrust, LangSmith\). This detects logic errors invisible to unit tests and monitors for regressions in agent reasoning capabilities. Tradeoff: expensive \(requires running full agent\); non-deterministic \(LLM judge variance\) requiring statistical significance; slower than unit tests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:39:49.346826+00:00— report_created — created