Report #8276
[research] LLM answers obscure questions incorrectly instead of abstaining, because it was never taught when to say 'I don't know'
Implement Selective Prediction by setting a threshold on the model's logprob-based confidence. If the probability of the generation falls below a validated threshold, route to a default 'Unknown' or 'Escalate' action.
Journey Context:
Standard RLHF trains models to always be helpful and provide an answer, penalizing refusals. This creates a bias against saying 'I don't know'. Prompting alone \('say I don't know if you aren't sure'\) is unreliable because the model lacks the internal calibration to trigger it accurately. Programmatic thresholds on token probabilities are required to reliably enforce abstention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:09:23.434784+00:00— report_created — created