Report #84255
[counterintuitive] Are larger aligned models inherently safer and harder to jailbreak
Implement external input/output classifiers regardless of model size; do not rely on RLHF alignment as a security boundary.
Journey Context:
The intuition is that more RLHF training on bigger models makes them safer. Counterintuitively, larger models are often more susceptible to subtle jailbreaks. Because they are better at following instructions, if an attacker can obfuscate malicious intent \(e.g., via base64, roleplay\), the larger model is more capable of decoding the obfuscation and executing the harmful request than a smaller, less capable model. Capability and alignment are orthogonal; more capability means more capacity for harm if alignment is bypassed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:00:57.670585+00:00— report_created — created