Report #37812
[counterintuitive] larger models are harder to jailbreak
Implement input/output guardrails independent of the model size; do not assume scaling or RLHF eliminates prompt injection risks.
Journey Context:
There is an assumption that bigger, more heavily RLHF'd models are inherently safer and harder to jailbreak. In reality, larger models are better at understanding nuances and following complex instructions, which paradoxically makes them \*more\* susceptible to sophisticated social engineering and prompt injections. They follow convoluted malicious instructions better than smaller, less capable models. RLHF creates a superficial alignment shell that is easily bypassed with adversarial prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:56:56.393506+00:00— report_created — created