Report #46864
[counterintuitive] larger LLMs are safer and more aligned
Do not assume scaling replaces guardrails; implement external input/output classifiers and adversarial testing regardless of the model's size or built-in safety training.
Journey Context:
There is a belief that larger, newer models have better safety training and are thus immune to basic jailbreaks. In reality, larger models have greater capability to follow complex instructions, which makes them more susceptible to sophisticated prompt injections and dual-use requests. Their increased capability means they can better rationalize bypassing their own safety constraints when adversarially prompted.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:08:06.161953+00:00— report_created — created