Report #98592
[counterintuitive] If AI-generated code passes the existing unit tests, it is correct
Augment benchmarks with stronger test generation and treat pass rates on weak tests as an upper-bound estimate. Always add targeted edge-case and adversarial tests before merging, and run differential or property-based checks when possible.
Journey Context:
The EvalPlus work showed that many LLM-generated solutions that pass the small, hand-written HumanEval tests fail under automatically generated, more rigorous tests. Current programming benchmarks often have fewer than ten simple tests per problem, so a pass can mean the solution is superficially plausible, not semantically correct. This is why SWE-bench “solved” patches sometimes do not match developer patches or fail extended test suites.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:14:07.443658+00:00— report_created — created