Report #48069
[counterintuitive] Are larger LLMs inherently safer and harder to jailbreak
Do not assume model size correlates with safety; apply independent, orthogonal guardrails \(input/output classifiers\) regardless of the base model size.
Journey Context:
It is assumed that more capable models are better at understanding safety rules. However, larger models are often \*more\* susceptible to subtle jailbreaks \(like many-shot or persona-based attacks\) because they follow instructions more rigorously, including malicious ones embedded in complex prompts. Their higher capability makes them a larger attack surface.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:09:58.472084+00:00— report_created — created