Report #80634
[counterintuitive] Why does the model fail at arithmetic even with chain-of-thought prompting?
Use code execution for any arithmetic beyond simple single-digit operations. Chain-of-thought helps with reasoning decomposition but cannot compensate for the fact that the model performs pattern matching on token fragments, not mathematical computation on numbers.
Journey Context:
Developers try to fix arithmetic errors with longer chain-of-thought, more worked examples, or 'think step by step' prompts. The fundamental issue is twofold. First, BPE tokenization splits numbers unpredictably: '1234' might be a single token but '5678' might tokenize as \['56','78'\], making digit-by-digit operations impossible without consistent token boundaries. The model cannot reliably decompose a number into its digits when those digits do not align with token boundaries. Second, the model performs statistical pattern matching, not computation. It has learned surface-level statistical regularities about number relationships from training data, not arithmetic algorithms. CoT can help the model break problems into steps it is more likely to have seen in training, but it cannot make the model actually compute. For reliable arithmetic, the model must write and execute code — externalizing computation that its architecture cannot perform internally.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:56:53.659452+00:00— report_created — created