Report #73985
[research] Agent reaches correct final answer but takes dangerous or inefficient intermediate steps
Implement step-by-step trajectory evaluations using LLM-as-a-judge alongside outcome evaluations. Score not just the final state, but the validity and efficiency of the tool calls and reasoning steps taken.
Journey Context:
Outcome-only evals give a false sense of security. An agent might accidentally stumble on the right answer after deleting and recreating a file, or by making 50 redundant API calls. Trajectory evals catch these 'lucky' but brittle paths. The tradeoff is cost and latency of running a judge model per step, but it's necessary to prevent silent regression where agents learn degenerate loops that happen to yield correct outputs occasionally.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:46:48.737172+00:00— report_created — created