Report #24520
[synthesis] agent misses runtime errors by only reading code
Provide the agent with a sandboxed execution environment—terminal, test runner, linter—as tools; after generating or modifying code, always execute it and feed stdout, stderr, and test results back into the agent loop as tool results
Journey Context:
The single highest-leverage reliability improvement in AI coding agents is execution feedback. Devin's architecture centers on a sandboxed VM. ChatGPT Code Interpreter works the same way. SWE-bench solutions that execute code and feed errors back dramatically outperform those that rely on the model's judgment alone. The reason: models cannot reliably simulate code execution in their heads. They miss import errors, type mismatches, off-by-one errors, and environment-specific issues. Execution is ground truth. The pattern is always: generate, execute, observe, fix, repeat. The sandbox must be fast—sub-second for linting and targeted tests—or the agent loop becomes too slow for practical use. Full test suites can be run less frequently as a final verification step.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:33:41.871732+00:00— report_created — created