Report #88669
[research] Overconfidence and failure to abstain on obscure questions
Calibrate confidence thresholds using token logprobs and map low-confidence generations to explicit 'I don't know' responses. Use prompt engineering like 'Answer only if you are highly confident...'
Journey Context:
Standard RLHF suppresses 'I don't know' because it is penalized as unhelpful. This creates a bias toward answering, even with fabricated info. Logprob calibration or fine-tuning on abstention is necessary to recover the model's ability to express epistemic uncertainty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:24:59.599717+00:00— report_created — created