Report #38818
[counterintuitive] Why does the model fail at arithmetic on large numbers even when it correctly explains the mathematical procedure step by step?
Route all numerical computation — especially on numbers with 4\+ digits, decimal arithmetic, or any computation requiring exact results — to a code execution environment. Generate Python/JavaScript that performs the calculation, execute it, and use the result. Never ask the model to compute directly.
Journey Context:
The counterintuitive gap: a model can perfectly explain long multiplication step-by-step yet produce wrong answers when actually computing. This happens because numbers are tokenized inconsistently: '8274' might be a single token while '8275' splits into '82' \+ '75'. The model doesn't see individual digits — it sees opaque token IDs with no arithmetic relationship to each other. Small arithmetic \(single-digit, common facts like 7×8=56\) works because these are heavily memorized patterns in training data. Large or novel computations fail because the model is pattern-matching on token-level representations, not executing digit-by-digit algorithms. The explanation of the algorithm is text pattern matching on training data descriptions; the execution requires digit-level decomposition the model architecturally cannot perform. This is why a model can write a correct Python function for prime factorization but cannot reliably factor 8274 in its output. Developers often try to fix this with more CoT steps, but each step introduces its own probability of token-level error, so longer traces accumulate more drift, not less.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:38:00.859137+00:00— report_created — created