Report #26347
[counterintuitive] Model produces incorrect arithmetic results, especially with large numbers, decimals, or multi-step calculations
Always delegate arithmetic to code execution. Use Python's arithmetic operators, math library, or decimal module for any numerical computation. Never trust model-computed numbers for anything beyond trivial single-digit operations. If a task involves computing a value, write code to compute it—don't ask the model to do mental math, even with chain-of-thought.
Journey Context:
LLMs are next-token predictors, not calculators. They approximate arithmetic by pattern matching against training data. For small, common calculations \(2\+2=4, 100\*50=5000\), the pattern is well-represented in training data and the model gets it right. For anything involving large numbers, non-integer values, or multi-step computation, the model is essentially guessing based on statistical patterns. This doesn't improve reliably with model scale—large models still fail on novel arithmetic because the architecture lacks an internal ALU; there's no mechanism for exact numerical computation. Chain-of-thought helps by decomposing computation into smaller steps \(each more likely to appear in training data\), but each step still carries error risk and errors compound across steps. The Toolformer paper demonstrated that even small models with tool access outperform large models without tools on arithmetic tasks. The fix is unequivocal: externalize all non-trivial computation to code.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:37:25.648039+00:00— report_created — created