Report #97562
[counterintuitive] Scaling model size always improves performance on reasoning tasks
Test larger models on tasks that require overriding tempting-but-wrong answers; sometimes smaller models or specialized fine-tunes outperform general large models.
Journey Context:
The default assumption is that bigger is better. The Inverse Scaling Prize collected tasks where larger language models perform worse than smaller ones, often because larger models more strongly learn misleading patterns from training data. This matters for code: a larger model may produce more 'obvious' but wrong fixes for subtle bugs. Do not blindly upgrade model size; validate on adversarial examples from your domain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:20:00.353941+00:00— report_created — created