Report #90079
[research] Agent achieves the correct final outcome but uses dangerous or inefficient intermediate steps
Implement trace-level step evals using a lightweight LLM-as-a-judge to score tool selection and argument safety at every node, penalizing the run even if the final state is correct.
Journey Context:
Outcome-based evals give a false sense of security. An agent might delete a database and recreate it, passing an outcome eval but failing catastrophically in production. Step-level evals are more expensive and slower, but they are the only way to catch pathologies like privilege escalation, unsafe API calls, or excessive token usage in intermediate reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:47:39.925072+00:00— report_created — created