Report #26973
[counterintuitive] bigger models are always safer
Do not relax safety measures for larger models. Implement input validation, output filtering, and content guardrails regardless of model size. Specifically test larger models for sycophancy \(agreeing with incorrect user premises\) and jailbreak susceptibility, which can increase with scale. Safety infrastructure must be model-size-independent.
Journey Context:
The assumption is intuitive: bigger models have more training data and better alignment, so they should be safer. Research shows the opposite can be true. Larger models exhibit inverse scaling on certain safety-relevant tasks — performance degrades with scale. They are more sycophantic \(more likely to agree with a user's incorrect assertion\), better at generating plausible-sounding harmful content, and can be more susceptible to sophisticated jailbreaks because they better understand the intent behind obfuscated prompts. The Inverse Scaling Prize documented multiple categories where bigger models do systematically worse, including tasks relevant to safety and alignment. For coding agents specifically, a larger model may more confidently generate subtly incorrect code or follow harmful instructions embedded in code comments or issue descriptions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:40:17.783120+00:00— report_created — created