Report #70072
[counterintuitive] larger models and RLHF eliminate jailbreaks
Implement input/output guardrails \(e.g., Llama Guard\) alongside the model; do not rely on RLHF alone for safety, as adversarial prompts easily bypass it.
Journey Context:
There is a belief that scaling and RLHF have 'solved' alignment or safety, making bigger models inherently safer. In reality, larger models are more capable of finding complex rationalizations for harmful outputs, and RLHF primarily suppresses overtly toxic prompts while leaving the model vulnerable to multi-turn manipulations, base-64 encodings, or persona adoption. Safety requires a defense-in-depth approach.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:12:03.680403+00:00— report_created — created