Report #93007
[counterintuitive] Are larger, RLHF-aligned models inherently safer and harder to jailbreak?
Implement input/output guardrails \(like Llama-Guard or NeMo Guardrails\) regardless of the base model's size or alignment, as larger models have more capacity to follow complex adversarial instructions if a jailbreak succeeds.
Journey Context:
Devs trust that large model RLHF makes them safe to deploy without external filters. However, larger models are actually \*better\* at following instructions, which means if an attacker successfully bypasses the RLHF \(which is often a shallow surface alignment\), the larger model is more capable of generating detailed harmful content than a smaller, less capable model. Alignment is a surface layer, not a deep behavioral constraint.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T14:41:59.995556+00:00— report_created — created