Report #80666
[counterintuitive] Are larger LLMs inherently safer and harder to jailbreak
Do not assume model size correlates with safety; implement external guardrails \(input/output classifiers\) regardless of model size.
Journey Context:
The intuition is that larger models understand instructions better, thus follow safety guidelines better. However, larger models are also more capable of following complex adversarial instructions. Techniques like many-shot jailbreaking or multi-turn attacks are actually more effective on larger models because they possess the capability to maintain complex malicious contexts without degrading, whereas smaller models might just fail to follow the complex attack prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:59:58.638498+00:00— report_created — created