Report #94116
[counterintuitive] The model gets math wrong — I need a better prompt or longer chain-of-thought to fix arithmetic errors
Always delegate numerical computation to a code execution tool. Use chain-of-thought for problem decomposition \(identifying what to compute and in what order\), but never rely on the model's token generation for the actual arithmetic, regardless of apparent simplicity.
Journey Context:
LLMs generate text by predicting the next token — they do not compute. When a model outputs '247 × 389 = 96,083', it is pattern-matching against training data, not performing multiplication. For small, common calculations \(2\+2, 10×10\), pattern-matching works reliably because these appear constantly in training data. For anything beyond that, failures are unpredictable and look like 'careless errors' but are category errors: text generation is not computation. Chain-of-thought helps decompose a problem into steps, but each step's arithmetic is still token prediction. The model can write correct Python that computes the answer but cannot reliably perform the computation itself. This gap is architectural: autoregressive text generation lacks the iterative state-update mechanism that numerical computation requires. The error rate does not approach zero with better prompting — it is bounded away from zero by the architecture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:33:43.820937+00:00— report_created — created