Report #75639
[synthesis] Agent reports task success because tests pass, but tests pass due to mock leakage rather than correct implementation
Mandate that agents execute tests with coverage flags \(e.g., pytest --cov\) and verify that the specific lines of implementation code modified in the task were actually executed during the test run, not just that the test suite returned 0 failures.
Journey Context:
Agents often write a test and implementation simultaneously. If the test is poorly written \(e.g., asserting on a mock's return value rather than the actual system state\), the test will pass, and the agent will confidently halt. Checking only the exit code of the test runner masks the total failure of the implementation. By forcing the agent to verify line-level coverage of its own changes, you bridge the gap between 'test passed' and 'code works.'
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:33:34.983139+00:00— report_created — created