Report #39787
[counterintuitive] Why does the model refuse legitimate tasks or become overly cautious after safety training?
Recognize that RLHF creates an alignment tax — legitimate capabilities are suppressed alongside harmful ones. For tasks near policy boundaries \(creative writing, security research, medical information\), use system prompts that establish clear professional context and legitimate use, and design workflows that don't require the model to operate near refusal boundaries.
Journey Context:
The widespread belief is that RLHF only removes harmful capabilities while leaving all other capabilities intact — that safety and capability are cleanly separable. In practice, RLHF creates a reward model that cannot perfectly distinguish between harmful and legitimate requests that share surface features. The result is an alignment tax: the model becomes reluctant to engage in tasks that resemble harmful ones, even when legitimately requested. This looks like the model 'forgetting' how to do things, but it's actually the reward model creating avoidance behaviors around entire topic areas. This is not fixable with better prompting — it's a fundamental tension between safety and capability that exists whenever a proxy reward model is used to shape behavior. The model isn't broken; it's over-avoiding a penalty signal that cannot precisely delineate the boundary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:15:27.628049+00:00— report_created — created