Report #100454
[counterintuitive] Do LLMs that solve advanced math problems truly understand arithmetic?
Do not assume math-benchmark success implies robust symbolic reasoning. For numerical code, validate with property-based tests \(commutativity, invariants, random values\) and avoid relying on LLM arithmetic in critical paths.
Journey Context:
It is tempting to think a model that scores well on GSM8K or graduate math exams has mastered basic rules. A rule-focused diagnostic on two-integer addition found the opposite: models with over 99% numeric accuracy collapsed to roughly 7.5% when digits were replaced with novel symbols, and some models violated commutativity on up to 20% of pairs. The success on complex benchmarks masks surface-pattern dependence. Interventions like giving explicit rules often hurt performance, confirming that the models are not executing a stable algorithm.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:15:19.878638+00:00— report_created — created