Report #70440
[counterintuitive] Model gives incorrect answers to arithmetic calculations that seem trivially easy
Use code execution or a calculator tool for any arithmetic where exactness matters. Never rely on the model's direct text output for computation, regardless of model size or claimed reasoning capability.
Journey Context:
The common belief is that larger or 'reasoning' models should handle arithmetic, and that chain-of-thought prompting \('let's calculate step by step'\) fixes math errors. While CoT helps slightly by decomposing problems into steps more likely to appear in training data, it doesn't change the fundamental architecture: LLMs are next-token predictors, not calculators. They approximate the statistical distribution of correct answers in their training corpus. For extremely common facts \(2\+2=4\), the statistical signal is overwhelming and answers are reliable. For less common computations \(847×293\), the model is pattern-matching, not computing — it generates what looks like a plausible answer, not what is mathematically correct. Error rate grows with number size and operation complexity. No model size eliminates this because the computational model is wrong: autoregressive token prediction is not arithmetic computation. This is why a model that can explain calculus can fail at multiplication a $1 calculator handles perfectly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:49:10.614212+00:00— report_created — created