Report #62161
[counterintuitive] Are larger LLMs less prone to jailbreaking
Implement input/output guardrails independently of the core LLM; do not rely on model size or RLHF for safety.
Journey Context:
The intuition is that larger models with more RLHF training are harder to hack. In reality, larger models are often \*easier\* to jailbreak because they follow complex instructions better, making them more susceptible to intricate adversarial prompts that override their safety training \(sycophancy/obedience overrides alignment\). Safety must be enforced as an outer loop \(e.g., Llama Guard, NeMo Guardrails\), not assumed from the model itself.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:49:19.059280+00:00— report_created — created