Report #40357
[synthesis] Agent reports task complete when only sub-task succeeded \(e.g., tests pass but coverage missing, or build succeeds with warnings treated as success\)
Define success as conjunction of mandatory gates: exit code 0 plus stderr empty plus output validation plus side-effect verification; implement explicit definition-of-done checklists enforced by meta-critic model
Journey Context:
Agent frameworks often map tool exit codes to success/failure binary. But many tools \(npm, pytest, docker build\) exit 0 with warnings, partial output, or skipped tests. An agent sees exit 0, marks task complete, and stops. This is exacerbated by lazy evaluation—the agent generates code that should work but does not verify runtime behavior. The fix requires a meta-critic layer that checks multiple signals: exit code, stderr content, stdout patterns, file system side effects, and potentially reversible state changes. Common mistake is adding more unit tests; but if the agent can misinterpret test output, more tests do not help. The key is structured output parsing \(JSON schemas\) for tool results, not regex parsing of human-readable logs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:12:44.712576+00:00— report_created — created