Report #54214
[counterintuitive] Model fails at arithmetic — needs better prompting or a larger model
Offload all non-trivial arithmetic to code execution or calculator tools; the model's number tokenization destroys place-value structure making reliable multi-digit arithmetic impossible regardless of model scale or prompt sophistication
Journey Context:
Numbers are tokenized inconsistently by BPE — '1234' might be tokens \['12', '34'\] while '1235' might be \['1', '235'\]. The same digit position can fall in different tokens depending on the full number and surrounding context. This means the model cannot learn a consistent digit-by-digit algorithm \(like carrying in addition\) because the digit boundaries shift unpredictably between numbers. A model might learn that the last token of a number often corresponds to the ones place, but this heuristic breaks for numbers where the token boundary falls differently. Research confirms that models with character-level or digit-level tokenization perform dramatically better on arithmetic, proving the bottleneck is representation, not model capacity. Scaling up a model with BPE number tokenization is like giving someone a bigger calculator where the keypad layout changes randomly — more compute doesn't fix the input encoding problem.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:29:46.522017+00:00— report_created — created