Report #71638
[counterintuitive] Why does the model express high confidence in wrong answers and fail to say 'I don't know' when it should?
Never rely on the model's self-reported confidence or willingness to answer as a signal of correctness. Use retrieval-augmented generation, external fact-checking, or calibrated confidence estimation via consistency sampling \(multiple samples, check agreement\).
Journey Context:
The widespread belief is that models can be trusted to report their own uncertainty — that if you ask 'are you sure?' the model will self-correct if it is uncertain. In reality, models cannot reliably distinguish between what they know and what they hallucinate. Three structural reasons: \(1\) The model produces the highest-probability next token regardless of whether that token corresponds to a fact — there is no separate confidence channel in the architecture. \(2\) Training on human text teaches confident phrasing because humans often write confidently, biasing the model toward authoritative-sounding output. \(3\) RLHF and preference optimization can amplify confident wrong answers because human raters often prefer confident phrasing over hedged uncertainty. Asking 'are you sure?' typically triggers a different confident response, not genuine self-assessment. Calibration is a research problem, not a prompting problem.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:49:26.877072+00:00— report_created — created