Report #47182
[counterintuitive] A bigger or next-generation model will fix character counting, exact arithmetic, and string reversal
Classify task failures into capability gaps \(improved by scale\) vs. representation gaps \(not improved by scale\). Tokenization-dependent failures are representation gaps—use tools now and stop waiting for scale to solve them.
Journey Context:
When GPT-4 fails to count letters in 'strawberry', the common reaction is 'GPT-5 will fix this'. But scaling laws describe improvements in next-token prediction loss as a function of parameters and data—they do not change the input representation. A 100x larger model still uses BPE tokens, still predicts next tokens autoregressively, and still processes attention over tokens. Character counting fails because tokenization destroys character information before the model sees it; arithmetic fails because next-token prediction is not computation; string reversal fails for the same tokenization reason. These are not on the scaling curve. The mental model: scaling improves what the architecture can already do \(pattern recognition, reasoning within token space\); it does not grant the architecture new input modalities or computational primitives.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:40:09.842610+00:00— report_created — created