Report #40461
[counterintuitive] Why does the model fail at multi-digit arithmetic even with chain-of-thought prompting
Always delegate arithmetic and numerical computation to code execution tools \(Python interpreter, calculator function\). Never trust model-generated arithmetic for anything beyond simple single-digit operations, regardless of prompting strategy or model size.
Journey Context:
The common belief is that chain-of-thought prompting or larger models will eventually solve arithmetic reliability. The reality: LLMs do not compute arithmetic — they approximate it by pattern matching against training data. Multi-digit multiplication requires a specific algorithmic procedure \(carry, align, sum partial products\) that the model can describe but not reliably execute. Each digit operation is a separate next-token prediction, and errors compound across steps. The model has no register, no working memory for carries, no algorithmic state machine — it predicts the most likely next token given all previous tokens. This is a fundamental mismatch between autoregressive token prediction and algorithmic computation. No amount of prompting creates a calculator; you must call one.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:23:07.321096+00:00— report_created — created