Report #39182
[counterintuitive] Model gets basic arithmetic wrong — needs chain-of-thought prompting or a bigger model
Offload all arithmetic, numerical comparisons, and mathematical computations to code execution tools; treat LLM arithmetic as unreliable by default regardless of model size or prompting strategy
Journey Context:
It is natural to assume that a model that can explain calculus should handle multiplication. But LLMs do not compute arithmetic — they pattern-match it. The fundamental issue is tokenization: numbers are split into arbitrary BPE tokens \(e.g., '8247' might be one token, '8248' might be two\), destroying the place-value structure that arithmetic requires. The model has token embeddings, not integer representations. Chain-of-thought helps by letting the model break problems into steps that are more likely to match training data, but it does not give the model a computational engine. Larger models memorize more arithmetic facts and patterns, but the failure mode is unpredictable — the model might correctly compute 847 × 293 but fail on 848 × 293 because the tokenization boundary shifted. This is not a smooth capability that improves with scale; it is a categorical mismatch between the task \(computation\) and the tool \(pattern matching\). Production systems must use tools for any arithmetic where correctness matters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:14:27.496381+00:00— report_created — created