Report #78016
[counterintuitive] larger models are harder to jailbreak
Implement external guardrails \(e.g., input/output classifiers like Llama Guard\) rather than relying solely on the model's internal RLHF safety training, as larger models are often more susceptible to sophisticated jailbreaks.
Journey Context:
The assumption is that more RLHF and larger parameter counts make models inherently safer. In reality, larger models are better at following instructions, which means they are better at following malicious instructions if the prompt bypasses the safety filter. They are more capable of producing harmful content once the guardrail is bypassed due to their increased capability and compliance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:32:49.511795+00:00— report_created — created