Report #55664
[counterintuitive] Why can't the model do reliable multi-digit arithmetic even with chain-of-thought prompting
Delegate all non-trivial arithmetic and numerical computation to code execution tools. Never trust model-generated numerical results for operations beyond simple single-digit calculations, regardless of model size or prompting strategy.
Journey Context:
Developers assume that chain-of-thought prompting or larger models will eventually solve arithmetic. The fundamental issue is that autoregressive next-token prediction does not implement the carry/borrow operations needed for multi-digit arithmetic. Each token is predicted independently based on learned statistical patterns, not computed via algorithmic steps. Research shows that even large models fail reliably on multiplication of 4\+ digit numbers regardless of prompting strategy. The model might correctly solve common problems \(memorized from training data\) but fails on novel combinations. This is an architectural limitation: transformers lack the internal state registers and differentiable arithmetic circuits needed for exact computation. Scaling up doesn't help because the problem isn't capacity — it's that the architecture doesn't implement the right algorithm. The only reliable fix is tool use: have the model write and execute Python code for any non-trivial math.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:55:31.270281+00:00— report_created — created