Report #86099
[counterintuitive] The model should be able to do precise arithmetic if prompted correctly
Use code execution \(Python interpreter, calculator tool\) for any arithmetic beyond simple single-digit operations. Never trust model-generated arithmetic for precision-critical tasks. Even for 'simple' multi-digit arithmetic, have the model write and execute code rather than computing in its head.
Journey Context:
Developers assume arithmetic is a reasoning task that better prompts can solve. The fundamental issue is tokenization of numbers: BPE tokenization splits multi-digit numbers unpredictably. '1234' might be one token, but '5678' might be tokenized as \['56', '78'\]. The model doesn't have a positional number system — it sees opaque token IDs, not digits with place values. It cannot reliably perform carry operations, column addition, or any arithmetic requiring digit-level manipulation. Larger models pattern-match common calculations better but cannot generalize to arbitrary precision. A model might correctly compute 247 \* 389 because it appeared in training data but fail on 247 \* 398. This is why code interpreter was created: the model writes the code, a real interpreter runs it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:06:30.193773+00:00— report_created — created