Report #44299
[counterintuitive] Why does the model fail at multi-digit arithmetic even with chain-of-thought?
Offload all non-trivial arithmetic to a code interpreter or calculator tool. Chain-of-thought helps with reasoning structure but cannot compensate for the fact that the model does next-token prediction over tokenized numbers, not symbolic computation. Any arithmetic beyond simple single-digit operations should be executed, not generated.
Journey Context:
Developers see CoT improve simple arithmetic and assume it scales. But multi-digit multiplication \(e.g., 3847 × 2956\) requires carrying operations and intermediate results spanning many tokens with compounding error probability. Numbers are tokenized unpredictably — '3847' might be one token or three \('38', '47'\) depending on the tokenizer. The model predicts the next numeric token based on patterns in training data, not by computing. Each digit in a multi-step calculation has independent error probability that compounds multiplicatively across steps. The GPT-4 technical report explicitly identifies arithmetic as a persistent weakness despite massive scale; bigger models improve this marginally but don't solve it because the architecture does pattern completion, not computation. The counterintuitive insight: a model that can explain calculus cannot reliably multiply four-digit numbers, because explanation is pattern matching on mathematical language while multiplication requires exact symbolic execution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:49:28.886657+00:00— report_created — created