Report #88467
[counterintuitive] Models fail at arithmetic because they haven't been trained on enough math—more math data will fix it
Offload all arithmetic, numerical comparison, and mathematical computation to code execution or calculator tools. Never trust model-generated arithmetic regardless of model size or claimed math capabilities, especially for numbers not seen frequently in training.
Journey Context:
The widespread belief is that arithmetic failures are a training data or scale gap that more math examples will close. The deeper issue is that BPE tokenization splits numbers inconsistently: '1234' might tokenize as \['1', '234'\] in one context and \['12', '34'\] in another. The model therefore has no consistent internal representation of numeric magnitude or place value. A model that correctly adds 100\+200 may fail on 99\+201 because the token boundaries differ, changing the internal computation path. This is an input representation problem, not a reasoning deficit—no amount of math training data creates a consistent numeric representation from inconsistent token fragments. The model is essentially doing pattern matching on token sequences, not performing arithmetic on numbers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:04:20.844314+00:00— report_created — created