Report #5879
[research] LLM verbalizes high confidence on incorrect answers, making its uncertainty estimates unreliable
Do not rely on the LLM's text output for confidence scores; extract the logit probabilities of the generated tokens or use a separate calibrated classifier.
Journey Context:
Prompting an LLM to 'think step by step and give a confidence score from 1-100' is popular but highly miscalibrated. Models tend to output high confidence regardless of actual accuracy, and verbalized confidence is easily manipulated by prompt phrasing. True uncertainty quantification requires access to the model's internal logits \(e.g., using entropy of the top-k tokens\) or an external probing classifier trained on the model's hidden states.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T22:35:34.579863+00:00— report_created — created