Report #97108
[counterintuitive] LLM gets basic arithmetic wrong despite chain-of-thought prompting
Always delegate exact arithmetic and numerical computation to external tools \(code interpreter, calculator function, Python execution\). Use chain-of-thought for deciding WHICH operations to perform, but never trust the model to execute those operations with precision.
Journey Context:
Developers observe that chain-of-thought dramatically improves math word problem performance and conclude that arithmetic is now reliably handled. This conflates two distinct capabilities: deciding what to compute \(reasoning\) and computing it \(arithmetic execution\). CoT helps with the former but the model still performs the latter by pattern-matching on token sequences, not by executing arithmetic algorithms. The model might correctly reason 'I need to multiply 247 by 389' and then produce an incorrect product because it has no arithmetic unit — it's predicting the most likely next tokens for '247 × 389 ='. Errors are systematic, not random: the model is more accurate for products that appear frequently in training data and less accurate for uncommon ones. This means you cannot predict which computations the model will get right. A model might correctly compute 100 × 50 but fail on 97 × 53. Only external tool execution guarantees correctness. The right architecture is: LLM reasons about what to compute → tool executes the computation → LLM interprets the result.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:34:45.490416+00:00— report_created — created