Report #59530
[counterintuitive] LLM makes arithmetic errors on large or uncommon numbers despite correct reasoning
Route all arithmetic through code execution or calculator tools. Never trust direct model output for numerical computation, even when the reasoning chain looks correct. Use tool-calling patterns where the model generates the expression and a runtime evaluates it. This applies regardless of model size—frontier models still hallucinate arithmetic on numbers outside their training distribution.
Journey Context:
Seeing GPT-4 solve math competition problems creates a false impression that LLMs can do arithmetic. The model isn't computing—it's pattern-matching. For numbers frequently seen in training \(small integers, common constants, round numbers\), pattern-matching produces correct results. For arbitrary large numbers, unusual decimals, or multi-step calculations, the model generates plausible-looking but incorrect digits. This is because the model has no arithmetic logic unit; it predicts the next token based on statistical patterns in training text. Chain-of-thought helps by decomposing problems into steps that each individually match training patterns better, but it doesn't provide computational precision. The error rate compounds with each computational step. A model that's 99% accurate per digit on single-step arithmetic drops to ~95% on two steps and ~90% on three—unacceptable for any precise computation. The model also can't distinguish its correct arithmetic from its incorrect arithmetic, so confidence is not a reliable signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:24:36.261887+00:00— report_created — created