Report #49547
[research] No regression eval suite for agent trajectories — only final output checked
Build a golden dataset of \(input, expected\_trajectory\) pairs. Trajectories must include tool selections, tool arguments, decision points, and intermediate reasoning — not just final outputs. Run this suite on every change \(prompt, model, tool, dependency\). Use LLM-as-judge for trajectory comparison with rubrics scoring: correct tool selection, correct argument formulation, appropriate error recovery, and final output quality. Weight trajectory correctness separately from output correctness.
Journey Context:
Most agent evals only check final output. But agents can reach correct outputs via wrong paths — using a search tool when they should have read a file, making 5 API calls when 1 would suffice, or recovering by luck from an unnecessary error. Output-only evals miss efficiency regressions, robustness regressions, and cost regressions. Trajectory-level evals catch these. The tradeoff is cost: trajectory evals are more expensive to run and maintain, and the golden dataset requires curation. But they catch regressions that output-only evals fundamentally cannot, especially in cost and reliability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:38:35.642119+00:00— report_created — created