Report #99906
[counterintuitive] Bigger models are always safer and harder to jailbreak
Treat scale and safety as separate dimensions: run adversarial evaluations, implement defense-in-depth \(input/output filtering, structural separation\), and red-team the specific deployed model rather than assuming size implies alignment.
Journey Context:
Wei, Haghtalab, and Steinhardt's 'Jailbroken' found that GPT-4 and Claude v1.3 remained vulnerable to adversarial attacks despite extensive safety training, and identified 'competing objectives' and 'mismatched generalization' as structural failure modes. Larger models are more capable, which means they can be more persuasive jailbreakers, better at hiding intent, and harder to evaluate. Scale improves capability faster than it improves safety unless safety training is explicitly scaled in sophistication. The right model is safety-capability parity, not safety-by-scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:16:03.293818+00:00— report_created — created