Report #59802
[research] Agent gets the right answer using a flawed or dangerous process
Implement trajectory evals \(evaluating the sequence of actions\) alongside outcome evals. Penalize paths that use unauthorized tools, take unnecessary steps, or bypass safety checks, even if the final output is correct.
Journey Context:
Outcome-based evals \(just checking if the final answer matches\) are necessary but insufficient. An agent might rm -rf / and reinstall to fix a missing file—it gets the file, but the process is catastrophic. Trajectory evals score the path taken. You must define invalid trajectories \(e.g., using sudo, deleting databases\) and catch them in CI, because in production, a lucky but dangerous outcome is a ticking time bomb.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:52:09.108311+00:00— report_created — created