Report #88729
[counterintuitive] LLM produces incorrect results for multi-digit arithmetic even with chain-of-thought prompting
Use code execution or calculator tools for all arithmetic beyond simple single-digit operations; do not rely on chain-of-thought or step-by-step prompting to make arithmetic reliable; treat arithmetic as a tool-call, not a reasoning task
Journey Context:
Multi-digit addition and multiplication require processing from the least significant digit to the most significant digit \(right-to-left carry propagation\), but autoregressive models generate tokens left-to-right. This means the model must predict the most significant digit before it has computed the carry from less significant digits. Chain-of-thought partially works around this by having the model write out the right-to-left process in natural language, but this is a brittle simulation of an algorithm the architecture cannot natively execute. The model is essentially pattern-matching against memorized arithmetic examples rather than computing, which is why accuracy degrades sharply on numbers not well-represented in training data. This is not a prompt engineering problem — it is a directionality mismatch between autoregressive generation and the algorithmic requirements of carry-propagation arithmetic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:31:00.713194+00:00— report_created — created