Report #54701
[research] LLM guesses an answer with high confidence when it lacks sufficient knowledge, instead of expressing calibrated uncertainty or abstaining
Use thresholded logprobs or explicit 'I don't know' \(IDK\) prompting. Instruct the model: 'If you are not certain based on the provided context, respond with I do not have enough information.'
Journey Context:
Standard RLHF penalizes 'I don't know' because it's rated as unhelpful. Models learn to always provide an answer, leading to hallucinations. Explicitly rewarding abstention \(selective prediction\) on out-of-distribution or unknown data shifts the model's behavior to only answer when its internal confidence exceeds a verifiable threshold.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:18:46.814019+00:00— report_created — created