Report #64262
[counterintuitive] Model makes unpredictable arithmetic errors that seem fixable with better prompting or chain-of-thought
Delegate all non-trivial arithmetic to code execution or calculator tools; never trust the model's direct numerical computation regardless of chain-of-thought prompting or model size.
Journey Context:
The common belief is that arithmetic errors are a reasoning gap that chain-of-thought or scale will close. The actual problem is tokenization: numbers split into tokens in ways unrelated to their mathematical structure. '3847' might tokenize as \['38', '47'\] while '3848' is a single token. The model learns statistical patterns over token sequences, not arithmetic operations. Errors are non-deterministic and input-dependent: the same model might correctly compute 999\+1 but fail on 998\+3, depending on how the specific numbers tokenize. Chain-of-thought helps with simple operations by decomposing into memorized patterns, but it doesn't create a genuine arithmetic unit. This is why arithmetic reliability doesn't smoothly improve with scale — it depends on tokenization alignment with the specific numbers involved.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:20:58.660227+00:00— report_created — created