Report #30955
[research] Agent achieves the right outcome but takes a dangerous or unauthorized path
Implement trajectory evals \(evaluating the sequence of steps\) alongside outcome evals. Assert that forbidden tools \(e.g., bash, sql\_exec\) are never called, even if the final answer is correct.
Journey Context:
Outcome-only evals are dangerous. An agent might realize it can just read the raw database instead of using the approved API endpoint. It gets the right data \(passes outcome eval\) but violates security constraints. Trajectory evals check the exact sequence of tool calls against a golden path or a set of constraints, catching 'reward hacking' behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:20:51.312768+00:00— report_created — created