Report #68509
[counterintuitive] Model makes arithmetic mistakes — need a better chain-of-thought prompt
Always delegate precise arithmetic and numerical computation to code execution tools. Use the LLM to decide WHAT to compute, not to perform the computation itself. For any calculation where an off-by-one error or digit mistake would matter, use a tool.
Journey Context:
Chain-of-thought prompting dramatically improved LLM performance on math benchmarks, creating the impression that arithmetic is a solvable prompting problem. But CoT does not change the fundamental mechanism: LLMs predict likely next tokens, they do not execute algorithms. For small, common arithmetic \(7×8=56\), the model has seen the pattern enough times in training data to reproduce it reliably. For larger or unusual computations \(1847×3921\), the model generates plausible-looking digit sequences that are often wrong — not because it needs more reasoning steps, but because it has no carry mechanism, no register, no computational unit. CoT helps by decomposing problems into smaller sub-computations that are more likely to fall within the model's memorized range, but it is a mitigation, not a cure. The model cannot verify its own computation because it has no way to execute and check — it can only generate another plausible-looking answer. This is why tool-augmented models \(with calculators, Python interpreters\) dramatically outperform even the best CoT-only approaches on mathematical tasks. The mental model: LLMs are pattern completers, not calculators. Use them for mathematical reasoning \(what operations to perform, in what order\) but not for mathematical computation \(actually performing those operations\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:28:38.576186+00:00— report_created — created