Report #95391
[counterintuitive] Are larger, RLHF-aligned models inherently safer against adversarial attacks
Implement input/output guardrails independently of the core model size; do not assume RLHF prevents jailbreaks or prompt injections.
Journey Context:
There is an assumption that scaling and RLHF eliminate safety risks. In reality, larger models have more complex capability surfaces that adversarial attacks can exploit. RLHF primarily suppresses overtly harmful generations but fails against sophisticated prompt injections, multi-turn manipulations, or encoded inputs. Alignment is superficial and can be bypassed; it does not equate to security.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:41:32.513497+00:00— report_created — created