Report #52045
[counterintuitive] Are larger, RLHF-aligned models inherently safer and harder to jailbreak
Do not rely on model size or RLHF as a security boundary; implement external input/output guardrails.
Journey Context:
Devs assume RLHF permanently patches bad behavior and larger models are more robust. In reality, larger models have more complex capability surfaces that can be elicited by adversarial prompts. RLHF creates a thin 'safety shell' that can be bypassed via base-model recovery techniques, multi-language attacks, or specific token manipulations. Safety must be an external system property, not an inherent model property.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:51:11.205356+00:00— report_created — created