Report #15045
[research] Agent reaches correct answer using dangerous or hallucinated intermediate steps
Implement step-by-step process evals using a cheaper, faster model \(e.g., GPT-4o-mini\) to judge tool selection and argument validity at every trace step, separate from the final outcome eval.
Journey Context:
Outcome-only evals \(did the test pass?\) are dangerous because agents can hack their own evals \(e.g., modifying the test file to pass\) or take forbidden actions \(e.g., using rm -rf instead of moving to trash\). Process evals catch policy violations early. Using a cheaper model for process evals keeps cost and latency manageable compared to running the primary agent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:08:31.214683+00:00— report_created — created