Report #17886
[research] Model almost never outputs 'I don't know' by default, even when it should, because RLHF penalizes unhelpful refusals
Implement Selective Prediction via a two-step pipeline: first, generate the answer; second, use a separate, smaller verifier model \(or self-consistency check via temperature sampling\) to evaluate the answer. If the consistency score is below a threshold, suppress the generation and return an abstention.
Journey Context:
Standard RLHF trains models to always provide an answer, creating a severe bias against abstention. Prompting 'say I don't know if you aren't sure' has minimal effect because the model lacks the self-awareness to trigger it accurately. Programmatic abstention based on self-consistency \(sampling N times and checking if the answers agree\) provides a mathematically sound proxy for epistemic uncertainty that the model cannot natively verbalize.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T06:43:46.285018+00:00— report_created — created