Report #83370
[counterintuitive] RLHF fine-tuning permanently deletes dangerous capabilities from the base model
Treat RLHF models as unsafe by default at the application layer. Implement external guardrails and input/output filters, never relying solely on the model's internal alignment.
Journey Context:
RLHF adjusts the probability distribution to avoid unsafe outputs, but the underlying representations and capabilities remain in the weights. Adversarial prompts, fine-tuning, or even specific multi-turn contexts can easily bypass the RLHF safety surface. This is known as shallow alignment. Safety must be enforced externally, as the model's weights cannot be securely partitioned by RLHF.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:31:27.026866+00:00— report_created — created