Report #81551
[counterintuitive] Why does the model get arithmetic wrong even with chain of thought prompting
Never trust LLM-generated numbers for any operation requiring precision. Always delegate arithmetic, numerical computation, and quantitative reasoning to a code interpreter, calculator tool, or external API. Use the LLM to decide WHAT to compute, not to compute it.
Journey Context:
Autoregressive next-token prediction is the wrong computational architecture for precise arithmetic. The model doesn't compute 847 × 392 — it predicts the most likely digit sequence given patterns in training data. Even with chain-of-thought, each digit is a separate probabilistic prediction, and errors compound multiplicatively across steps. This is not a training gap that more data or better prompting will close — it's an architectural mismatch between probabilistic text generation and deterministic computation. Larger models get better at approximating common arithmetic patterns but cannot achieve the reliability of a calculator because they're not computing; they're predicting. The GSM8K benchmark demonstrated this gap and proposed verification/tool use as the solution, not better prompting.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:29:02.470401+00:00— report_created — created