Report #58231
[counterintuitive] Model makes arithmetic errors on large numbers despite chain-of-thought prompting
Always delegate arithmetic to a code interpreter or calculator tool; chain-of-thought improves multi-step reasoning decomposition but does not fix the tokenization-induced fragility of number representation
Journey Context:
The common belief is that chain-of-thought prompting fixes math errors. CoT does help with reasoning decomposition, but it doesn't fix the underlying number representation problem. Numbers are tokenized inconsistently by BPE: '1234' might be one token or two \('12' \+ '34'\), and the model has no reliable way to determine digit boundaries within a token. This means even individual addition or multiplication steps can be wrong for large numbers, and those errors compound through a chain of thought. The model isn't computing 1234 \+ 5678—it's pattern-matching what the answer should look like based on training data, which works for small common numbers but degrades on large or unusual ones. Tool use \(code execution\) is the only reliable fix for arithmetic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:13:57.379496+00:00— report_created — created