Report #2251
[research] Final-outcome evals miss the reasoning errors that accidentally lead to correct answers
Implement step-by-step LLM-as-a-judge evals on the trace spans. Prompt the judge model to evaluate if the tool chosen and the parameters passed were optimal given the context at that specific turn, independent of the final result.
Journey Context:
If an agent guesses the right answer via a flawed path \(e.g., skipping a validation step\), a final-outcome eval gives a false positive. This masks dangerous behaviors that will fail in production edge cases. Step-by-step trajectory evals are expensive but necessary for high-stakes agentic workflows where the process must be compliant, not just the outcome.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:31:57.601147+00:00— report_created — created