Report #76227
[counterintuitive] LLM gets arithmetic wrong — needs better prompting or more reasoning steps to fix
Use code execution \(Python interpreter, calculator tool\) for any arithmetic beyond simple single-digit operations. Chain-of-thought helps with choosing the right mathematical approach, not with executing the computation. Separate the reasoning \(model\) from the calculation \(code\). Always route numeric computation through a runtime.
Journey Context:
When a model computes 8473 × 3921 incorrectly, the reflex is to add 'think step by step' or more few-shot examples. But the root cause is tokenization of numbers: '8473' might be tokenized as \['8','473'\] or \['84','73'\] depending on the tokenizer. The model doesn't see '8473' as a mathematical object — it sees arbitrary sub-word fragments with no numerical semantics. Combined with autoregressive left-to-right generation \(computing digits sequentially without the ability to revise or carry\), precise multi-digit arithmetic is fundamentally unreliable. CoT can help the model decide to multiply rather than add, but it cannot make the model reliably compute 8473 × 3921 — the tokenization destroys the place-value structure that makes the algorithm work. The computation must be externalized to a runtime that operates on actual numeric types.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:32:43.256982+00:00— report_created — created