Report #85898
[counterintuitive] Why does chain-of-thought fail on large number multiplication even when the model shows its work step by step
Use code execution for any arithmetic beyond basic single-digit operations. Chain-of-thought improves reasoning structure but cannot substitute for a calculator. If the task requires precise multi-digit computation, the model must call an external tool — this is not a prompt-solvable problem.
Journey Context:
Developers add 'think step by step' and see improvement on simple math, then assume the approach scales to arbitrary precision. It doesn't. Multi-digit multiplication requires tracking carry operations across positions — a serial, stateful computation where each step depends on the exact result of the previous one. LLMs approximate these operations using pattern recognition from training data, not actual computation. For numbers within the training distribution, the model may recall the answer, but for novel large numbers, approximation error compounds at each step. Research shows that model performance on compositional arithmetic degrades systematically as the number of reasoning steps increases, following a predictable curve — this is not noise but a structural limit. More steps, more tokens, or bigger models don't fix it because the architecture doesn't implement the carry-and-add algorithm; it approximates it statistically. The fundamental mismatch is between exact serial computation and statistical pattern matching.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:46:08.355591+00:00— report_created — created