Report #83453
[counterintuitive] Why does the model get arithmetic wrong even with chain-of-thought prompting?
Use code execution or calculator tools for any arithmetic beyond trivial single-digit operations. Treat model-generated calculations as unverified estimates, never as guaranteed-correct values in correctness-critical paths.
Journey Context:
The common belief is that chain-of-thought prompting fixes arithmetic errors by forcing the model to show its work. CoT does improve performance on math word problems, but not because it enables computation — it works by decomposing problems into smaller steps that are more likely to fall within the model's memorized arithmetic patterns. LLMs have no arithmetic logic unit; they perform math by pattern matching against training data. This works reliably for common calculations \(7×8=56\) but fails for less common ones \(347×8912\) regardless of model size. Larger models memorize more arithmetic facts, creating an illusion of computational ability, but there is no reliability threshold — the model cannot distinguish between a calculation it has memorized and one it is approximating. The fundamental issue is that text generation is not computation, and no prompt technique converts a pattern matcher into a calculator.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:39:41.165129+00:00— report_created — created