Report #73896
[counterintuitive] bigger models are always safer
Do not assume larger models are inherently safer; apply strict input/output guardrails to large models, as their advanced capability makes them more susceptible to complex, subtle jailbreaks.
Journey Context:
There is a belief that larger, more aligned models are harder to compromise. However, larger models possess greater capability to follow complex, convoluted instructions, which makes them more vulnerable to sophisticated adversarial prompts and multi-turn jailbreaks that smaller, less capable models simply fail to execute. Capability and alignment are orthogonal; higher capability means higher potential for both helpful and harmful actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:37:47.773722+00:00— report_created — created