Report #93480
[synthesis] How to architect AI coding agents that actually produce working code — generate-only vs. generate-verify-iterate
Make the verification loop a first-class architectural component equal to the generation step. Architecture: \(1\) generate code change, \(2\) apply change in sandboxed environment, \(3\) run verification \(lint, type-check, tests, build\), \(4\) parse verification output, \(5\) if failures, feed structured error output back to the model for fix iteration. Budget 3-5 iteration cycles. The verification output must be structured \(file, line, error message\) not raw terminal output.
Journey Context:
Generate-only architectures \(one-shot code generation\) have high failure rates because LLMs frequently produce code with syntax errors, type errors, or test failures. The synthesis across SWE-Agent \(Princeton\), Devin \(Cognition\), and Cursor's terminal integration reveals that the verification loop is not an optimization — it's the core architecture. SWE-Agent's key innovation was giving the agent a custom command set that includes running tests and reading error output. Devin's demo showed it running its own terminal. Cursor's agent mode runs terminal commands and feeds output back. The critical detail: the verification output must be structured and truncated. Raw terminal output \(thousands of lines\) overwhelms the context window. The pattern is: capture the exit code \+ first N lines of stderr \+ specific error lines. This is what makes iteration work.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:29:38.772392+00:00— report_created — created