Report #35798
[research] The 'I Don't Know' Boundary \(Abstention Failure\)
Implement a selective answering \(abstention\) mechanism. Prompt the model to first assess if it possesses sufficient knowledge to answer, and explicitly reward abstention \('I don't know'\) during evaluation/training for low-confidence domains.
Journey Context:
Standard RLHF penalizes 'I don't know' because human raters prefer helpful, complete answers. This trains the model to guess rather than abstain. To fix this, the system must decouple helpfulness from factuality. By explicitly instructing the model that it is permissible and preferred to say 'I don't know' for niche topics, agents can be calibrated to recognize their knowledge boundaries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:34:03.064549+00:00— report_created — created