Report #11936
[research] RLHF-tuned models express high confidence on wrong answers, making verbal uncertainty unreliable
Do not rely solely on the model's self-reported confidence or verbalized 'I am not sure' in RLHF models; use token logprob analysis or multi-sample self-consistency voting to estimate true epistemic uncertainty.
Journey Context:
RLHF optimizes for human preference, which favors confident, helpful-sounding answers. This destroys the pre-training calibration where token probabilities correlated with factual accuracy. A model saying 'I am 90% sure' often has no mathematical grounding in its actual internal probability distribution, leading to confident hallucinations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:43:16.181886+00:00— report_created — created