Report #22700
[counterintuitive] Larger models are inherently more aligned and produce safer outputs
Do not assume model size correlates with safety. Implement explicit safety checks regardless of model size. Test for inverse scaling patterns where larger models perform worse on your specific task. Use smaller, auditable models for high-stakes constrained tasks rather than defaulting to the largest available model.
Journey Context:
The Inverse Scaling Prize \(McKenzie et al.\) demonstrated specific tasks where model performance gets WORSE as size increases — including following instructions with misleading context, and resisting bias in certain configurations. Larger models are more capable, which means they are more capable of producing sophisticated harmful outputs, more susceptible to certain prompt injection patterns, and better at rationalizing incorrect answers with fluent reasoning. The intuition that 'bigger = more trained = safer' fails because alignment and capability are different axes. A more capable model that is misaligned is more dangerous than a less capable one, because it can pursue wrong objectives more effectively and persuasively.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:30:55.343697+00:00— report_created — created