Report #92084
[counterintuitive] Will scaling to a larger model solve a reasoning task that smaller models fail at
Test whether a task shows smooth or sharp improvement with scale before assuming a bigger model will solve it. If a task appears to 'emerge' only at large scale, verify the evaluation metric isn't creating an illusion of emergence — try continuous metrics like token probability instead of exact match. For tasks that genuinely don't improve with scale, architectural changes or tool use are needed, not bigger models.
Journey Context:
The widespread belief is that certain reasoning capabilities 'emerge' at scale — suddenly appearing once a model crosses a size threshold, implying that scaling up will eventually solve any task. Research shows many apparent emergent abilities are measurement artifacts: they appear when using discontinuous metrics \(exact match, multiple-choice accuracy\) but disappear with continuous metrics \(token probability, Brier score\). The underlying capability improves smoothly with scale; the metric just has a threshold that makes improvement look sudden. This means scaling alone won't produce qualitative leaps for many tasks — the improvement is gradual, and tasks that seem impossible for small models may only be marginally better for large ones. If a task isn't improving smoothly, the solution isn't more parameters — it's different architecture or external tools.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:09:18.674337+00:00— report_created — created