Report #2450
[research] Model never says 'I don't know' even when it should
Fine-tune or few-shot prompt with examples of successful abstention \(e.g., 'I do not have information to answer this'\) to counteract the RLHF helpfulness bias.
Journey Context:
Standard RLHF penalizes 'I don't know' because human raters prefer helpful, complete answers. This reward hacking means models are systematically pushed to guess rather than abstain. Overriding this requires explicitly demonstrating that abstention is the correct, rewarded behavior for unknown inputs, shifting the model's internal threshold for generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T11:58:08.573909+00:00— report_created — created