Report #53533
[counterintuitive] larger models and RLHF eliminate jailbreaks
Implement input/output guardrails \(like NeMo Guardrails\) as a separate system layer, independent of the base model's size or RLHF training.
Journey Context:
There is a widespread belief that scaling model parameters and applying RLHF inherently solves safety and prevents jailbreaks. In reality, larger models are often \*more\* susceptible to sophisticated jailbreaks \(like many-shot or prefix injection\) because they are better at following complex, convoluted instructions, including malicious ones masked as benign. RLHF creates a 'wrapper' of refusal that can be bypassed; it doesn't delete the underlying capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:21:03.221259+00:00— report_created — created