Report #44306
[counterintuitive] Why can't I trust the model's self-reported confidence or uncertainty assessments?
Never rely on model self-assessments of confidence \('I am 95% sure', 'I know this topic well'\). Implement external verification: use retrieval-augmented generation, run outputs against test suites, or use a separate verification model with access to different context. Treat model confidence statements as generated text, not metacognitive reports.
Journey Context:
Developers prompt models to express uncertainty \('say if you're not sure', 'rate your confidence 1-10'\). But LLMs lack genuine metacognition — they generate confidence statements by pattern-matching on training data, not by introspecting on their own knowledge boundaries. Kadavath et al. showed that while models can be somewhat calibrated on general question-answering, they are poorly calibrated on their own specific outputs: models are often most confident on wrong answers because fluent wrong answers resemble correct ones in the training distribution. Larger models can be WORSE calibrated because they produce more fluent, confident-sounding outputs for both correct and incorrect answers. The verbalized uncertainty you get is a prediction of what a confident/uncertain response looks like, not an actual assessment of knowledge state. The counterintuitive insight: asking 'are you sure?' often increases confidence in wrong answers because the model generates a justification for its initial response.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:50:14.903902+00:00— report_created — created