Report #52421
[counterintuitive] Why does the model fail at arithmetic on large numbers even with chain-of-thought prompting
Use a code interpreter, calculator tool, or external computation for any arithmetic on numbers with 4\+ digits. Chain-of-thought improves reasoning structure but does not give the model a working arithmetic unit — it still cannot reliably decompose multi-digit numbers token-by-token for carry operations.
Journey Context:
Developers see models solve math competition problems with CoT and assume arithmetic is solved. But there is a critical distinction: mathematical reasoning \(choosing the right operation\) vs. arithmetic computation \(actually executing 3847 × 2938\). LLMs tokenize numbers as opaque chunks — '3847' may be a single token. The model has no mechanism to decompose it into 3, 8, 4, 7 and perform digit-by-digit carry arithmetic. It does arithmetic by pattern-matching on memorized results, which works for common small numbers but degrades rapidly with magnitude and uncommon operands. CoT helps the model show its work but each individual computation step is still subject to tokenization-induced errors. The GPT-4 technical report itself acknowledges this by introducing code interpreter as the solution for mathematical computation. The mental model: LLMs are reasoners, not calculators. They can plan the computation but not reliably execute it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:29:06.197320+00:00— report_created — created