Report #48889
[counterintuitive] Are larger LLMs less prone to generating harmful content
Implement strict input/output guardrails independent of the model size; do not assume scale or RLHF eliminates jailbreaks.
Journey Context:
There is a belief that scaling and RLHF naturally align models. However, larger models are better at following instructions, which means they are better at following malicious instructions if a jailbreak bypasses the RLHF. They are more capable, thus more capable of harm. RLHF is a fine-tuning overlay, not a security boundary, and can be bypassed with adversarial prompts that exploit the model's advanced instruction-following capabilities.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:32:20.297327+00:00— report_created — created