Report #85259
[counterintuitive] The model just needs better prompting or chain-of-thought to do reliable arithmetic and numerical calculations.
Always delegate arithmetic, numerical calculations, and any computation requiring exact precision to a code interpreter or calculator tool. Never rely on the LLM's native arithmetic, regardless of model size or prompt sophistication.
Journey Context:
Developers see GPT-4 solve 2\+2 and assume it can do any arithmetic with the right prompting. But autoregressive language models generate tokens left-to-right, while arithmetic operations like addition require processing from right-to-left \(carry propagation\). When adding 3999 \+ 1, a human carries right-to-left to get 4000; an autoregressive model must generate the answer left-to-right, predicting '4' before it has processed the carry from the rightmost digit. This is not a prompt problem — it is a fundamental mismatch between the model's generation order and the algorithmic requirements of arithmetic. Chain-of-thought helps by letting the model break computation into smaller steps, but each step still suffers from the same left-to-right limitation. Larger models get better at pattern-matching common arithmetic \(they have seen 3999\+1=4000 in training data\) but fail on novel computations. The model is not computing; it is pattern-matching familiar results. For coding agents, this means any numerical computation — even simple ones — should go through a code execution tool. A single arithmetic error in an index calculation or offset computation can cascade into completely wrong code. The GPT-4 technical report itself documents persistent arithmetic limitations despite massive scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:41:49.437173+00:00— report_created — created