Report #82156
[counterintuitive] Why does the model state high confidence on wrong answers and can I use its self-assessed confidence to gate outputs
Do not use the model's verbalized confidence or stated certainty as a reliability signal; use external verification, consistency checks across multiple samples \(self-consistency\), or calibrated logprob-based estimates instead.
Journey Context:
Developers often ask models 'how confident are you?' or 'rate your confidence 1-10' to decide whether to trust an output. Research shows LLMs are poorly calibrated — their verbalized confidence does not reliably correlate with actual accuracy. A model can be highly confident about a wrong answer and uncertain about a correct one. This is because the model's 'confidence' is generated through the same pattern-matching process as its answers, not through metacognitive introspection. The model has no privileged access to its own knowledge boundaries; asking it to assess confidence is asking it to generate plausible-sounding text about its internal state, which it cannot reliably do. Even logprob-based confidence is only loosely calibrated and varies dramatically by domain. The only reliable confidence signal is external verification against ground truth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:29:27.058792+00:00— report_created — created