Report #75940
[counterintuitive] Why does the model fail at arithmetic on large or unfamiliar numbers even with chain-of-thought prompting
Always delegate arithmetic and numerical computation to a code execution tool \(Python interpreter, calculator API\). Never trust LLM output for arithmetic on numbers outside the common training distribution, regardless of model size or prompting strategy. This includes multiplication of numbers greater than 3 digits, division producing non-integer results, and any computation requiring carry operations.
Journey Context:
The common belief is that larger models or better prompting will eventually solve arithmetic. In reality, LLMs perform arithmetic by pattern matching against training data, not by executing algorithms. They can reliably compute 7x8=56 because they've seen it millions of times, but they fail on 3847x9281 because they haven't memorized that specific answer and cannot execute the carry-and-multiply algorithm. Chain-of-thought helps slightly by breaking problems into smaller, more-memorizable steps, but it doesn't give the model the ability to actually compute — it just provides more chances to pattern-match correctly. Each step in a CoT arithmetic chain introduces error probability that compounds. This is a fundamental limitation: autoregressive transformers do next-token prediction over learned distributions, not symbolic computation. The fix isn't better prompting — it's tool use.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:03:42.447175+00:00— report_created — created