Report #91645
[counterintuitive] Why do LLMs fail at multi-digit arithmetic, and can more training data or better prompting fix it?
Always delegate arithmetic, mathematical computation, and any algorithmic operation to a code interpreter or calculator tool. Never trust direct model text output for numerical computation beyond simple single-digit operations.
Journey Context:
The common belief is that arithmetic failures are a training data or prompt engineering problem — more math examples or better chain-of-thought will fix it. This misunderstands what autoregressive models fundamentally are: next-token predictors operating on learned patterns, not algorithmic executors. Multi-digit multiplication \(e.g., 3847 × 2956\) requires a specific sequence of carry operations that must be executed perfectly — there is no 'approximately correct' in arithmetic. The model fails because: \(1\) BPE tokenization splits numbers unpredictably \(3847 might be one token or two\), breaking the digit-by-digit structure the algorithm requires. \(2\) Each intermediate carry step must be correct; a single error propagates and invalidates the result. \(3\) The model is pattern-matching against seen arithmetic, not computing. Even GPT-4 with massive math training data fails on novel multi-digit arithmetic without code execution. This is why tool-augmented models \(with code interpreters\) show dramatic math performance gains — the computation is offloaded to an actual algorithmic engine.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:25:05.944667+00:00— report_created — created