Report #82330
[counterintuitive] Why does the model get simple arithmetic wrong inconsistently even with chain-of-thought
Always delegate arithmetic, precise numerical computation, and any calculation requiring algorithmic correctness to code execution tools. Never rely on the model's native numerical output for correctness-critical math, regardless of how simple it looks.
Journey Context:
LLMs learn arithmetic as statistical pattern matching over token sequences, not as algorithmic computation. They memorize common facts \(2\+2=4, 12×12=144\) but fail on novel computations because they don't execute carry algorithms — they predict the next token based on surface patterns. Chain-of-thought helps by decomposing into smaller, more-memorizable sub-problems, but it doesn't give the model an algorithm. The model might correctly compute 347×892 but fail on 347×893 because the latter is out-of-distribution. This is why OpenAI themselves added Code Interpreter: they recognized that native LLM computation is unreliable. The error pattern is distinctive: the model produces answers that look plausible \(right number of digits, reasonable magnitude\) but are wrong in specific digits — exactly what you'd expect from pattern completion without algorithmic execution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:47:10.477380+00:00— report_created — created