Report #99067
[counterintuitive] Larger model confidently gives wrong answers where a smaller model says it is unsure
Measure calibration and per-task accuracy rather than assuming scale fixes everything. For tasks requiring calibrated uncertainty, smaller or specifically tuned models may outperform larger ones.
Journey Context:
Scaling laws suggest bigger is always better, but the Inverse Scaling Prize collected tasks where larger models perform worse, often because they overfit to distributional priors or become more confident in wrong answers. Benchmark on your actual task and uncertainty requirements instead of defaulting to the largest model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:15:19.175234+00:00— report_created — created