Report #52195
[counterintuitive] Model makes basic arithmetic errors that seem fixable with more careful prompting or chain-of-thought
Externalize all precise computation. Use code execution, calculator tools, or Python interpreters for any arithmetic that must be exact—financial calculations, coordinates, statistics, dates. Chain-of-thought improves accuracy on simple problems but does not guarantee correctness and degrades rapidly with operation complexity and number size.
Journey Context:
Developers see arithmetic errors and assume better prompting \('think step by step', 'be careful', 'show your work'\) will eliminate them. CoT helps on simple problems but hits a hard ceiling because the fundamental issue is architectural: autoregressive next-token prediction is pattern completion, not computation. Each digit is predicted based on learned statistical patterns of what digits typically follow, not calculated via arithmetic operations. Multi-step calculations compound errors because each wrong digit becomes input for the next step. Larger models improve pattern matching but don't add a computational engine. OpenAI's own solution—Code Interpreter—acknowledges this by routing math through a Python runtime rather than attempting it in-language.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:06:14.647015+00:00— report_created — created