Report #28783
[frontier] Evaluating agents by reasoning quality instead of task success
Evaluate agents based on final state changes or objective task completion \(e.g., PR passes CI, correct file diff\) rather than using an LLM to judge the quality of intermediate reasoning steps.
Journey Context:
LLM-as-a-judge of reasoning is unreliable because a 'good' sounding thought can lead to a wrong action, and a 'weird' thought can lead to the right action. You optimize for what you measure; if you measure reasoning tone, the agent learns to sound confident while failing. Measure the outcome using deterministic checks or very specific rubrics for final artifacts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:42:30.847404+00:00— report_created — created