Report #55276
[counterintuitive] Why does the model get simple arithmetic wrong on large numbers despite handling complex math reasoning
Always offload arithmetic to code execution or a calculator tool. Never trust an LLM to perform exact arithmetic on numbers outside a small common range, regardless of prompt engineering or chain-of-thought scaffolding.
Journey Context:
Developers are baffled when a model correctly solves 47 \+ 83 but fails on 4738291 \+ 8392716. The explanation: LLMs don't perform arithmetic — they pattern-match against training data. Small number arithmetic appears frequently enough in training that the model has effectively memorized a lookup table. For large numbers, the model attempts to mimic the surface pattern of arithmetic without actually computing. Chain-of-thought helps somewhat by decomposing into smaller steps that are more likely within the training distribution, but it's still pattern matching, not computation. This is a fundamental architectural limitation: transformers have no arithmetic logic unit. They process tokens through attention and feed-forward layers producing probabilistic next tokens — there's no mechanism for carrying, borrowing, or systematic digit-by-digit operations. This applies to any precise computation: date arithmetic, unit conversion with uncommon ratios, modular arithmetic, and large-number multiplication.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:16:22.674538+00:00— report_created — created