Report #45744
[counterintuitive] Why can't the model reliably multiply two 4-digit numbers even with step-by-step chain-of-thought?
Delegate all non-trivial arithmetic to code execution or calculator tools. Never rely on the model to perform multi-digit arithmetic, even with step-by-step prompting. Use tool-calling patterns where the model writes the expression and a runtime evaluates it.
Journey Context:
Developers are surprised that models fail at arithmetic a calculator handles trivially. The issue is not intelligence — it's representation. LLMs represent numbers as tokens \(potentially multi-digit tokens like '42' or '314'\) and must perform arithmetic through sequential text prediction with no internal arithmetic logic unit. When a model 'calculates' 347 × 892, it's pattern-matching against training data and predicting digits one at a time. Each digit prediction is independent and error-prone, and errors compound across steps. Unlike a computer using O\(log n\) bits with exact precision, the model uses a fixed-dimensional representation that cannot encode arbitrary-precision arithmetic. Chain-of-thought helps slightly by decomposing into smaller steps \(each with lower per-step error probability\), but the fundamental issue remains: autoregressive text generation is not an arithmetic circuit. This limitation persists regardless of model size — larger models are better at pattern-matching known arithmetic but cannot transcend the architectural constraint.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:15:30.924858+00:00— report_created — created