Report #65312
[counterintuitive] Model fails at math — needs more few-shot examples or a better chain-of-thought prompt
Route all non-trivial arithmetic through a calculator tool or code interpreter; use chain-of-thought only to decompose problems into steps, then execute each computational step with a tool, not the LLM.
Journey Context:
Developers treat arithmetic errors as fixable reasoning gaps. But LLMs don't compute arithmetic — they pattern-match against training data. They reliably output '2\+2=4' because that sequence appears millions of times in training, but fail on '847291\+293847' because that specific computation wasn't memorized and the model has no arithmetic logic unit. Chain-of-thought helps by decomposing into smaller steps more likely to be in the training distribution, but each step is still pattern-matched, not computed, and errors accumulate across steps. This is an architectural limitation: autoregressive transformers predict token distributions, they don't execute algorithms. Scaling model size improves performance on common arithmetic patterns but doesn't eliminate the fundamental mismatch between statistical prediction and exact computation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:06:18.898423+00:00— report_created — created