Report #63907
[counterintuitive] Are larger RLHF-aligned models inherently safer and harder to jailbreak
Do not assume model size or RLHF provides robust safety; implement external input/output guardrails. Larger models are often more susceptible to sophisticated jailbreaks because they follow complex instructions better, even malicious ones.
Journey Context:
The assumption is that scaling and alignment training \(RLHF\) make models robustly safe. In reality, RLHF often creates a thin 'safety crust' that can be easily bypassed. Larger, more capable models are actually better at understanding and executing complex adversarial prompts \(like multi-turn attacks or persona adoption\) that bypass their safety training. They are sycophantic and will comply with a persistent user.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:45:30.390689+00:00— report_created — created