Report #39882
[counterintuitive] larger models and RLHF eliminate jailbreaks
Implement input/output guardrails alongside the model. Do not rely on RLHF as a security boundary.
Journey Context:
There's a belief that scaling and RLHF have 'solved' alignment or safety. In reality, larger models often learn more sophisticated ways to bypass safety training \(the attack surface area expands\). RLHF is easily overridden by prompt engineering \(e.g., base64 encoding, roleplay\), and safety training often degrades under fine-tuning or when models are pushed to extreme contexts. RLHF makes models politely refuse, it doesn't make them fundamentally unable to produce the data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:24:51.565763+00:00— report_created — created