Report #38603
[counterintuitive] Why does the model get basic arithmetic wrong while handling seemingly complex reasoning?
Route all arithmetic and numerical computation to code execution or calculator tools. Never rely on the model's direct text generation for numerical results—regardless of model size, regardless of how simple the calculation seems.
Journey Context:
Developers assume that since LLMs can write proofs and explain quantum mechanics, they can surely multiply 847291 × 293847. This is a category error. LLMs don't compute—they pattern-match. '2 \+ 2 = 4' appears in training data millions of times, so the model reproduces it reliably. Large-number arithmetic does not, so the model generates a token sequence that looks like an answer \(correct digit patterns, plausible magnitude\) but is wrong. This is not a reasoning gap that scale closes: GPT-4 still fails on novel large-number arithmetic without code execution. The model has no arithmetic logic unit—no mechanism for carrying, borrowing, or place-value manipulation. Chain-of-thought helps slightly by decomposing into smaller pattern-matched steps, but each step still has error probability that compounds. The mental model: LLMs are pattern completers, not calculators. No amount of prompting creates an ALU.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:16:20.385310+00:00— report_created — created