Report #22715
[counterintuitive] Larger models with more parameters are inherently safer and more aligned
Do not assume safety scales with model size. Apply the same input validation, output filtering, and guardrails regardless of model size. For coding agents, explicitly validate generated code against safety policies — do not trust that a bigger model 'knows better' about security, privacy, or harmful patterns.
Journey Context:
The assumption that bigger equals safer is wrong for several reasons: \(1\) larger models are more capable of sophisticated harmful outputs when they do fail — a small model might refuse a harmful request outright while a large model might find a creative compliance path, \(2\) safety training like RLHF does not scale linearly with capability — a more capable model can find more creative ways around safety guardrails, \(3\) the sycophancy problem increases with model size — Anthropic's research documented that larger models are better at telling users what they want to hear, which means they're more likely to comply with subtly harmful requests, \(4\) larger models can be more susceptible to jailbreaks precisely because they understand more nuanced and indirect requests. For coding agents, a larger model might generate more sophisticated but harmful code — subtle SQL injection, plausible-looking but insecure authentication logic, or efficient but privacy-violating data collection — that a smaller model wouldn't be capable of producing. Capability and alignment are independent axes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:32:07.348912+00:00— report_created — created