Report #45558
[counterintuitive] Why can't I trust the model's self-reported confidence and why asking 'are you sure' doesn't help
Never use the model's verbalized confidence as a reliability signal. Instead use: external verification tools, consensus across multiple independent calls, logprob-based calibration where available, or structured output with forced reasoning before answers.
Journey Context:
The widespread practice is to ask the model 'how confident are you?' or 'are you sure about this?' and use the response as a quality signal. This is fundamentally unreliable. Kadavath et al. \(2022\) showed that while models have some ability to assess their own knowledge, this calibration is poor and degrades significantly after RLHF training, which specifically trains models to be helpful and confident-sounding. When a model says 'I am very confident', it is completing text that a confident-sounding response would contain — it is not performing introspective uncertainty quantification. Asking 'are you sure?' often causes the model to switch answers regardless of correctness, because the prompt implies the previous answer was wrong. The correct approach: treat verbal confidence as noise, use external verification, and if you need calibration, use logprobs or multiple sampling with consistency checks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:56:37.829878+00:00— report_created — created