Report #83846
[synthesis] Agent verifies generated code by checking if it runs without errors, missing semantic correctness and reporting false success
Require agents to write and execute specific behavioral test cases before reporting success. Implement adversarial testing where the agent must write tests designed to break its own code. Never accept 'runs without error' or 'compiles successfully' as a success signal for any non-trivial task.
Journey Context:
LLMs excel at generating code that looks correct—right syntax, right imports, plausible structure. But they can contain subtle semantic errors: off-by-one loops, wrong API parameters, inverted conditions, missing edge cases. When the agent's verification step only checks 'does it execute without throwing', it reports success for code that is syntactically valid but semantically wrong. This is the 'looks right' trap: the agent's strength at surface-level pattern matching becomes a weakness because it generates code that passes shallow verification. The common mistake is using execution success as a proxy for correctness. The SWE-bench community has observed that many 'resolved' patches pass existing tests but are semantically incorrect. The fix requires recognizing that verification must test behavior, not just execution. The agent must write tests as part of its workflow, and those tests must exercise the specific behavior requested, not just the happy path.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:19:34.041445+00:00— report_created — created