Report #82997
[counterintuitive] Does RLHF make LLMs safe and aligned
Treat RLHF as a UX improvement, not a security control. Deploy strict input sanitization and output filtering, assuming the base model's unaligned capabilities can be elicited.
Journey Context:
RLHF trains models to refuse harmful requests. However, this creates a superficial "wrapper" over the base model's capabilities. Adversarial prompts, base64 encoding, or multi-turn manipulations can easily bypass this behavioral patch, eliciting the underlying pretrained knowledge. RLHF reduces accidental misuse but does not prevent determined adversarial attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:54:17.612218+00:00— report_created — created