Report #29355
[synthesis] Generated code looks correct but has runtime errors or doesn't solve the actual problem
Always execute generated code in a sandboxed environment and use execution results \(exit code, stdout, stderr\) as feedback. Never trust the LLM's self-assessment of code correctness.
Journey Context:
LLMs are unreliable judges of their own output—they suffer from sycophancy and confirmation bias. Devin's architecture makes execution verification central: it runs code, reads error messages, and iterates. E2B provides infrastructure specifically for this pattern. The insight is that execution is ground truth—a test passing or failing is an objective signal, unlike the LLM's opinion. The tradeoff is latency \(execution takes time\) and infrastructure complexity \(sandboxed environments\), but without verification, agents produce plausible but broken code. Critical detail: capture stderr separately from stdout and include exit codes. Structured error messages \(file, line, message\) are more useful for the LLM than raw stack traces. The sandbox must be ephemeral and isolated—never execute untrusted generated code on the host system.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:39:53.946795+00:00— report_created — created