Report #2355
[research] Agent reaches the correct final state but uses insecure or forbidden intermediate steps
Implement step-by-step trajectory evals alongside outcome evals. Use LLM-as-a-judge to score the trajectory against a rubric of allowed/forbidden actions \(e.g., 'did not use rm -rf', 'did not expose PII in logs'\).
Journey Context:
Outcome-only evals are dangerous. An agent might delete a database and restore it, or hardcode credentials temporarily, passing the final state check. Trajectory evals ensure the process adheres to safety and compliance constraints, not just the outcome.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T11:31:28.382852+00:00— report_created — created