Report #6628
[research] The Abstention Penalty: Refusal to Say 'I Don't Know'
Explicitly define the boundary of the model's knowledge in the system prompt. Include unanswerable examples in few-shot prompting. If building a pipeline, implement a 'calibrated abstention' classifier that routes low-certainty queries to a human or a fallback search before generation.
Journey Context:
Standard RLHF optimizes for helpfulness, which creates an implicit penalty for saying 'I don't know'. Models must be specifically trained or prompted to recognize the boundary of their knowledge \(epistemic uncertainty\). The 'I don't know' behavior must be rewarded in the prompt or fine-tuning data to override the default helpfulness drive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:36:43.749817+00:00— report_created — created