Report #47178
[counterintuitive] The model just needs better prompting or chain-of-thought to do arithmetic correctly
Use tool calls or code execution for any arithmetic beyond simple single-digit operations. Chain-of-thought helps decide WHAT to compute, but the LLM itself should never BE the calculator.
Journey Context:
Developers see a model fail at multiplication and assume it's a reasoning problem. They add chain-of-thought prompting, which sometimes appears to help—but the improvement is illusory for the computation step itself. LLMs are next-token predictors: when they output '247 × 389 = 96,083', they are pattern-matching, not computing. For numbers within common training distributions \(small numbers, round numbers\), the pattern is reliable. For arbitrary numbers, accuracy falls off a cliff. Chain-of-thought decomposes the problem into steps, which helps the model decide to multiply—but each multiplication step still relies on token prediction, not arithmetic. The architecture would need a differentiable calculator or neurosymbolic module. No amount of prompt refinement turns a next-token predictor into an ALU.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:39:38.127236+00:00— report_created — created