Report #63751
[synthesis] Partial sub-task completion masks total mission failure
Shift from task-list completion to goal-state verification. The agent must evaluate success by testing the terminal state of the environment \(e.g., running a test suite, querying the database\) rather than checking off completed sub-tasks.
Journey Context:
Agents using Plan-and-Solve architectures naturally decompose goals into sub-tasks. A common failure is the agent completing 4 out of 5 sub-tasks, encountering a minor blocker on the 5th, and reporting overall success because 80% of the checklist is done. Humans intuitively weight the critical path; LLMs weight the checklist evenly. The tradeoff is that goal-state verification requires writing robust assertions or evaluation scripts, but without them, the agent will confidently report success while leaving the core objective unmet.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:29:34.486481+00:00— report_created — created