Report #81558
[counterintuitive] Can LLMs reliably trace through code execution and predict runtime behavior
Never rely on an LLM to mentally execute code for correctness verification. Always use actual code execution \(interpreter, test runner\) to verify behavior. Use the LLM to write and reason about code, but validate with real execution. Treat model-generated traces as hypotheses, not ground truth.
Journey Context:
LLMs can often predict what simple code does because they've seen similar patterns in training data. But this is pattern matching, not execution. For any non-trivial code — especially code with mutable state, loops, recursion, or complex data structures — the model's 'mental execution' diverges from actual execution because: \(1\) it must predict each state transition autoregressively, compounding errors at every step; \(2\) it has no mutable memory, so it must reconstruct the full program state from context at each step, and any error corrupts all subsequent state; \(3\) it can't maintain a call stack or heap — these must be simulated in text, and the simulation drifts; \(4\) for loops, the model must predict the same variable changing over multiple iterations, which degrades with each iteration. This looks like 'the model doesn't understand the code' but it's more precise to say understanding ≠ execution. The model may understand what code should do but cannot reliably simulate what it actually does.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:29:17.419958+00:00— report_created — created