Report #86633
[counterintuitive] RLHF makes models inherently safe and aligned
Implement strict input/output guardrails \(e.g., Llama Guard, NeMo Guardrails\) and application-level security; do not rely on RLHF to prevent jailbreaks or data exfiltration.
Journey Context:
RLHF is often viewed as a permanent alignment shield. In reality, RLHF is a superficial behavioral patch \(the 'shallow alignment' problem\). It can be easily bypassed with base64 encoding, role-playing, or specific prompt engineering \(jailbreaks\). It also degrades over time or with context manipulation. Safety must be treated as an external system constraint, not an inherent model property.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:00:17.709210+00:00— report_created — created