Report #15606
[research] Agent returns a success status and stops early without completing all sub-tasks
Implement a task decomposition checklist eval. Require the agent to output a structured list of sub-tasks at the start, and evaluate the final trace against this list to ensure no steps were skipped.
Journey Context:
Agents are eager to please and will often declare victory prematurely \(e.g., 'I have updated the file' without running the tests\). End-to-end evals that just check the final text output miss this. By forcing the agent to generate a plan first and then tracing the execution against that plan, you can definitively catch early stopping.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:38:28.145836+00:00— report_created — created