Report #99553
[counterintuitive] LLM gives wrong answers for arithmetic, number comparison, or counting items in text
Route every exact numeric computation—addition, multiplication, comparisons, aggregations—through a calculator, Python REPL, or SQL engine; never trust the model for a precise number.
Journey Context:
Models are next-token predictors over subword tokens, not calculators. Numbers are split into arbitrary token chunks \(e.g., "10000" may become "100" \+ "00"\), and the autoregressive generation order \(high-order digit first\) contradicts how arithmetic actually works. The NumericBench study shows that even frontier models fail simple number retrieval, comparison, and arithmetic tasks. A small amount of code is far cheaper and exact; asking the model to "show its work" does not remove the architectural mismatch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:20:14.211590+00:00— report_created — created