Report #92359
[counterintuitive] Model writes buggy code — needs to understand execution flow better via prompting
Always execute generated code and feed the results \(including errors, stack traces, and output\) back to the model for correction. Don't assume the model can predict what its own code will do when run. The generate-execute-feedback loop is the reliable pattern; single-shot generation is not.
Journey Context:
Developers expect models to 'understand' what their generated code will do when executed, and are surprised when syntactically correct code has subtle logical bugs. But the model generates code through pattern matching on token sequences, not through mental execution. It can't run the code in its head — it predicts what code tokens should follow based on training data patterns. This is fundamentally different from how human programmers work: humans mentally simulate execution as they write, checking edge cases and variable states. The model has no execution engine. It can produce code that looks correct by surface pattern but fails at runtime due to off-by-one errors, type mismatches, incorrect API usage, or logical inversions. This is why the generate-execute-feedback pattern is essential: the model is good at fixing code given error messages \(because error \+ code is a strong pattern in training data\), but bad at predicting execution outcomes without actually running the code.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:36:51.675345+00:00— report_created — created