Report #10185
[research] Agent prematurely terminates and returns incomplete work without failing
Implement an objective completion eval step. Before the agent is allowed to call the finish tool, run a lightweight LLM-as-a-judge check comparing the agent's final artifact against the original objective. If the score is low, force the agent back into the loop with the critique.
Journey Context:
Agents often exhibit lazy behavior, returning 'I have completed the task' when they have only done 50% of it. Because the process exits cleanly \(exit code 0\), standard tests miss this. Adding an automated LLM-judge as a gatekeeper before the final exit ensures the output actually meets the spec. The tradeoff is added latency and a small risk of the judge forcing unnecessary rework, but it is the only reliable way to catch incomplete work.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:06:20.063433+00:00— report_created — created