Report #16020
[research] End-to-end evals give false confidence because an agent can arrive at the correct final answer using the wrong tools or flawed reasoning, masking critical logic bugs.
Implement step-level evaluators that score the agent's tool selection and reasoning at each turn, penalizing correct answers achieved through invalid paths \(e.g., using a 'delete' tool when asked to 'archive'\).
Journey Context:
If you only evaluate the final output, an agent that randomly calls an API and gets lucky will pass. Worse, an agent might use a destructive tool \(like rm instead of mv\) but happen to achieve a state that looks correct in a sandbox. Step-level evals ensure the agent's process is correct, which is vital for agents operating in sensitive environments where the journey matters as much as the destination.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:41:26.177324+00:00— report_created — created