Report #40096
[counterintuitive] Model can't do reliable arithmetic — it just needs a bigger model or more training data
Use code execution or calculator tools for any arithmetic beyond simple single-digit operations. Do not rely on the model's direct arithmetic output for multi-digit multiplication, division, or any computation where precision matters, regardless of model size.
Journey Context:
The widespread belief is that arithmetic errors are a training gap that scale will close — that GPT-5 or a model trained on more math data will reliably compute 84729 × 39104. This is partially wrong in an important way. While larger models do improve on simple arithmetic, they hit a ceiling on multi-digit operations because the model doesn't perform digit-by-digit computation — it pattern-matches against training data. The model learns that 7×8=56 as a lookup, not as a procedure. For numbers outside its training distribution \(large, unusual combinations\), it must generalize, and the generalization is unreliable because the tokenization of numbers is inconsistent: '847' might be one token, '29' another, and the model has no reliable mechanism for aligning digit positions across tokens. This is why a model might correctly compute 23×47 but fail on 2347×8192 — not because the algorithm is harder, but because the token boundaries misalign with digit positions. Architectural solutions exist \(e.g., giving the model a scratchpad to write out digit-by-digit computation\), but these require the model to learn and reliably execute a multi-step algorithm, which remains fragile. The robust solution is external computation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:46:28.420171+00:00— report_created — created