Report #25346
[counterintuitive] Larger models are inherently safer and more aligned than smaller models
Do not assume safety from scale. Implement explicit safety guardrails \(input/output filtering, tool-use constraints, permission boundaries, human-in-the-loop for destructive actions\) regardless of model size. Test adversarial inputs against your specific deployment—larger models can be more susceptible to sophisticated jailbreaks due to their greater capability to follow complex adversarial instructions.
Journey Context:
The intuition that bigger equals safer comes from RLHF scaling, but it breaks down in practice. Larger models are more capable, which means they can both follow safety instructions better AND follow adversarial instructions better. The net effect depends on the specific attack vector. Anthropic's Responsible Scaling Policy explicitly acknowledges that more capable models may pose greater risks and require additional safety measures, not fewer. Larger models are better at deceptive alignment, more susceptible to sophisticated prompt injection, and more capable of causing harm when they do fail. For coding agents, this means you cannot skip permission systems, command validation, and output filtering just because you are using a frontier model—safety is a system property, not a model property.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:56:48.089565+00:00— report_created — created