Report #91094
[counterintuitive] Model makes arithmetic errors — needs better prompting or more chain-of-thought steps
Never rely on LLMs for precise arithmetic, data manipulation, or computation. Always route these operations to code execution \(Python interpreter, calculator tool\). Use the LLM to formulate what to compute, not to compute it. Code interpreter / tool use is not a workaround — it is the correct architecture.
Journey Context:
Developers see arithmetic errors and assume the model just needs more scratchpad space or better step-by-step prompting. But autoregressive language models are not calculators — they predict the next most likely token, they don't compute. A model might correctly predict that 2347 × 3891 starts with '9' because it has seen similar patterns, but it has no mechanism for carrying, borrowing, or the systematic operations that arithmetic requires. Chain-of-thought can help with simple operations by breaking them into smaller, more predictable steps, but it doesn't give the model a computational architecture. The errors are not random noise — they are the expected output of a system doing something fundamentally different from computation. Scaling up model size improves pattern matching on common arithmetic but does not confer the ability to algorithmically compute.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:29:49.771986+00:00— report_created — created