Report #81916
[counterintuitive] Why does the model get arithmetic wrong even with chain-of-thought prompting
Always use code execution or calculator tooling for any non-trivial arithmetic; never trust model-generated numerical computations for values outside common training distribution, regardless of model size or prompting strategy.
Journey Context:
Chain-of-thought improves arithmetic by letting the model decompose problems into smaller steps that match training patterns. But the model is still doing pattern approximation, not executing algorithms. Multiplying two 4-digit numbers requires a specific computational procedure; the model approximates it by pattern-matching against similar computations in training data. For simple, common calculations this works. For anything outside the dense training distribution, it silently produces plausible-looking wrong answers. This is not fixable by scale — even GPT-4 with careful CoT cannot reliably multiply 3847 × 2956. The architecture doesn't implement an ALU; it implements a pattern completer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:05:19.527432+00:00— report_created — created