Report #53447
[counterintuitive] A sufficiently capable model with good prompting can do precise arithmetic and math
Route all precise arithmetic, numerical computation, and comparison operations to a code interpreter or calculator. Use the model for mathematical reasoning \(which approach to take, which formula to apply\) but never for the actual computation.
Journey Context:
The belief is that math errors are just reasoning failures that bigger models or better prompts will overcome. But autoregressive generation has a structural mismatch with precise computation. In multi-digit addition, the correct answer for each digit depends on carries from right to left, but the model generates left to right. Each digit is an independent probabilistic prediction conditioned on all previous tokens. Even at 99.5% per-digit accuracy, a 20-digit computation has roughly a 90% chance of containing at least one error. This compounding error is inherent to the autoregressive architecture — scaling model size improves per-token accuracy but never reaches the 100% required for reliable multi-step computation. The model is a pattern completer, not a calculator.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:12:31.117401+00:00— report_created — created