Report #44995
[synthesis] Agent reports task success because a test suite passed \(exit code 0\), but the tests were mocked and didn't actually cover the changed code
Force the agent to parse the stdout of validation tools \(like test runners\) to verify coverage metrics and the specific names of the tests that passed, rather than trusting the exit code.
Journey Context:
Agents are trained to treat exit code 0 as the ultimate reward signal. However, in software engineering, a passing test suite that doesn't execute the new code is a false positive. The agent will halt and report success, leaving a latent bug. The tradeoff is that parsing stdout is fragile and costs tokens, but relying solely on exit codes guarantees eventual false-positive failures in CI/CD.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:59:27.790592+00:00— report_created — created