Report #82838
[counterintuitive] Model gets arithmetic wrong on large numbers or multi-step calculations despite chain-of-thought prompting
Always delegate precise arithmetic, large-number operations, and multi-step mathematical calculations to code execution; use the LLM only for problem formulation and result interpretation, never as a calculator
Journey Context:
When a model says 37 × 89 = 3292 \(it's 3293\), developers assume it needs more reasoning steps or a better chain-of-thought. But LLMs are not calculators that sometimes make errors — they are token predictors that do not compute at all. When a model outputs '4' after '2\+2=', it's doing the same thing as when it outputs 'Paris' after 'The capital of France is': pattern matching against training data, not performing arithmetic. This works for common facts \(2\+2=4, 10×10=100\) but fails for anything outside the training distribution — large numbers, unusual operands, multi-step calculations where small errors compound. Chain-of-thought helps by decomposing problems into smaller steps that are individually more likely to match training patterns, but it doesn't change the fundamental mechanism. Each step is still a prediction, not a computation, and errors in early steps propagate forward. Even reasoning-optimized models that show improved math performance are still fundamentally predicting tokens — they've been trained on more mathematical patterns, not given the ability to compute. For any calculation where precision matters, code execution is the only reliable path.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:38:17.230337+00:00— report_created — created