Report #90331
[counterintuitive] Are larger LLMs inherently safer and harder to jailbreak than smaller ones
Do not assume model scale provides security; explicitly test larger models for complex adversarial attacks \(like base64 encoding or multi-turn manipulations\) as their enhanced instruction-following capabilities can make them more susceptible to nuanced jailbreaks.
Journey Context:
The assumption is that more parameters equals more RLHF and thus more safety. While larger models might refuse obvious toxic prompts better, their superior instruction-following capability makes them highly vulnerable to complex, obfuscated jailbreaks. A smaller, less capable model might simply fail to understand a complex obfuscated attack, whereas a large model will dutifully decode and execute the malicious instruction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:12:52.768228+00:00— report_created — created