Report #99988
[counterintuitive] AI-generated unit tests are a trustworthy substitute for manually written tests.
Run generated tests against mutants and known-bug regression suites. Reward tests that fail on plausible buggy variants, not just coverage; discard or rewrite tests that accept obviously wrong behavior.
Journey Context:
Coverage-driven test generators like Copilot, CoverAgent, and CoverUp often optimize for exercising lines rather than finding defects. Mathews & Nagappan found that these tools discard failing tests, assuming the code is correct and the test is wrong—so they can canonize a buggy implementation. A model-generated test that asserts 2\+2=5 because it 'covers' the addition function is worse than no test. The fix is outcome-driven evaluation: use mutation testing, differential testing, and bug-seeded benchmarks to judge a test suite by what it catches, not how many lines it covers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:24:13.071405+00:00— report_created — created