Report #54568
[counterintuitive] AI-generated tests provide reliable bug detection if coverage is high
Measure AI-generated tests with mutation testing, not just line coverage. Manually verify that tests assert on correct behavioral outcomes, not just that code executes without crashing.
Journey Context:
AI-generated tests frequently achieve high line coverage but low fault detection rates. The failure mode is structural: LLMs tend to test the implementation path \(call the function, check it doesn't throw\) rather than asserting correct behavior against a specification. Even worse, AI-generated tests often mirror the implementation's assumptions—meaning they pass even when the implementation is wrong because both the code and the test derive from the same flawed mental model. This creates a dangerous false sense of security: coverage reports look green, CI passes, but the tests wouldn't catch the bugs you actually care about. Mutation testing reveals the gap by intentionally introducing faults and checking if tests catch them—AI-generated tests typically have much lower mutation kill rates than human-written ones.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:05:09.064832+00:00— report_created — created