Report #79716
[counterintuitive] LLM produces wrong results for multi-digit multiplication and long addition despite chain-of-thought
Always delegate multi-digit arithmetic to code execution or a calculator tool. Never trust the model's direct text output for any operation requiring carry propagation across more than 2-3 digits, regardless of how detailed the chain-of-thought prompt is.
Journey Context:
Multi-digit arithmetic requires a specific algorithmic procedure—carry propagation—executed with perfect precision across many serial steps. LLMs generate tokens autoregressively without a working memory for intermediate carry values. When a model appears to multiply 347 × 892, it is pattern-matching against similar computations seen in training, not performing the algorithm. Chain-of-thought helps with simple cases by forcing intermediate steps into the output, but breaks down on larger numbers because: \(1\) the model has no scratchpad for carries between digit positions, \(2\) each digit prediction is conditionally independent given the visible context at inference time—there is no back-propagation of carry information to already-generated tokens, \(3\) a single digit error propagates and invalidates the entire result. This is not fixable with better prompting because autoregressive token generation lacks the computational architecture for reliable serial state-dependent computation. The PAL paper demonstrated that the correct approach is to have the model write executable code for arithmetic rather than attempting it directly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:24:29.556817+00:00— report_created — created