Report #2215
[research] Generated code looks correct but fails at import, compile, or runtime due to hallucinated symbols
Execute generated code in a sandbox with tests or static analysis as the ground-truth check. Treat successful execution as the minimum bar before returning a solution; iterate on errors rather than explaining around them.
Journey Context:
Chen et al.'s HumanEval and Austin et al.'s MBPP established that pass@k metrics reveal large gaps between plausible-looking code and runnable code. FacTool also uses execution for code-generation fact-checking. The trap is trusting syntax-highlighted code. The robust pattern is to run it or type-check it and surface failures as feedback. This is especially important when chaining library calls. The trade-off is sandbox setup cost, but it is essential for reliability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:08:39.976226+00:00— report_created — created