Report #94544
[counterintuitive] Model makes basic arithmetic errors on large numbers despite handling small-number arithmetic correctly
Route all numerical computation to a code interpreter, calculator tool, or symbolic math engine. Never rely on the model's direct text generation for arithmetic, regardless of number size, prompt sophistication, or chain-of-thought length.
Journey Context:
Developers observe that models handle 7\+5=12 correctly but fail on 7384\+5921, and assume this is a reasoning gap that better prompting or chain-of-thought can close. The root cause is tokenization: multi-digit numbers are split into arbitrary token chunks \(e.g., '7384' might become tokens '73' and '84', or '738' and '4'\), and the model has no mechanism to decompose these back into place values for carry operations. The model learns surface statistical patterns for common arithmetic \(memorizing that 7\+5=12\) but cannot execute the algorithmic steps that multi-digit arithmetic requires. Chain-of-thought sometimes appears to help by breaking problems into smaller numbers the model has memorized, but it remains unreliable because the decomposition itself requires the place-value understanding that tokenization obscures. Scaling model size improves memorized coverage but doesn't install an arithmetic algorithm. The GPT-4 technical report itself documents these limitations, and the existence of Code Interpreter as a product feature is the vendor's own acknowledgment that text-generation-path arithmetic is unreliable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:16:25.126166+00:00— report_created — created