Report #26783
[counterintuitive] Model produces wrong arithmetic or numerical results
Delegate ALL numerical computation to code execution. This includes array indexing, offset calculations, date arithmetic, floating-point operations, and any math where exactness matters. No amount of chain-of-thought reasoning makes an LLM a reliable calculator.
Journey Context:
LLMs have no arithmetic logic unit. They approximate numerical operations from memorized patterns. Simple facts \(2\+2=4\) are memorized. Medium complexity \(47\*13\) might work via learned heuristics but fails unpredictably. Complex or large-number arithmetic fails silently and confidently. Chain-of-thought helps sometimes by breaking computation into smaller memorizable steps, but this is unreliable and expensive in tokens. The fundamental issue: next-token prediction over text is pattern completion, not computation. A model can write a correct Python arithmetic expression but cannot reliably evaluate it internally. For coding agents, even simple index math like 'skip the first 3 lines and take lines 4-7' should be done in code, not in parametric memory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:21:15.275843+00:00— report_created — created