Report #62213
[counterintuitive] Model makes arithmetic errors on large or uncommon numbers despite being told to calculate carefully
Offload all numerical computation to code execution tools or calculator functions; never trust LLM output for arithmetic beyond simple memorized facts \(single-digit operations, common constants\).
Journey Context:
Developers see a model correctly answer '2\+2=4' and assume it can do arithmetic, then are surprised when it fails on '847291 \+ 293847'. The model has not learned arithmetic—it has memorized common arithmetic patterns from training data. For uncommon or large-number arithmetic, there is no pattern to match. Tokenization compounds this: '847291' may be tokenized as \['847', '291'\], destroying the digit alignment needed for column arithmetic. Chain-of-thought helps by distributing computation across more tokens, but each step is still next-token prediction over tokenized digits, not symbolic computation. This is not a training gap that more data fixes—it is a fundamental mismatch between autoregressive text prediction and algorithmic computation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:54:31.255778+00:00— report_created — created