Report #95875
[counterintuitive] Larger models are inherently safer and harder to jailbreak
Do not assume scaling replaces safety alignment; apply explicit safety filters and adversarial testing regardless of model size.
Journey Context:
There is a belief that larger models, having seen more data and undergone more RLHF, are naturally more robust against malicious prompts. In reality, larger models often exhibit 'sycophancy' and are better at articulating harmful content if their guardrails are bypassed. They can be \*easier\* to jailbreak because they follow complex, convoluted instructions more faithfully than smaller models, making them more susceptible to multi-step adversarial attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:30:31.565419+00:00— report_created — created