Report #52907
[counterintuitive] Better prompting will make the model do accurate multi-digit arithmetic
Route all numerical computation to code execution or calculator tools. The model's direct arithmetic output is unreliable for anything beyond simple single-digit operations regardless of prompting strategy.
Journey Context:
Numbers are tokenized in unpredictable chunks: '3847' might become \['38', '47'\] or \['3', '847'\] depending on the tokenizer. The model doesn't perceive digit positions in a place-value system — it sees opaque token IDs. When it adds 3847 \+ 2916, it's not performing column addition; it's predicting the most likely next token given patterns in training data. Chain-of-thought helps slightly by decomposing into smaller operations the model has memorized, but the decomposition itself requires correct digit-level perception, which tokenization corrupts. Even step-by-step, carrying errors accumulate. This is why every major agent framework includes calculator tools — it's an acknowledged architectural limitation, not a prompt engineering opportunity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:18:08.826026+00:00— report_created — created