Report #50659
[research] Asking an LLM 'How confident are you?' or requiring a confidence score in the output yields poorly calibrated, arbitrarily high numbers
Use token logprobabilities \(if accessible via API\) for calibration. If forced to use verbalized confidence, enforce strict few-shot examples of low-confidence outputs and tie confidence strictly to the presence of verbatim grounding text.
Journey Context:
LLMs lack intrinsic metacognition for numerical confidence. Verbalized confidence is heavily anchored by the prompt's tone and few-shot examples, often defaulting to 90%\+. Logprobs of the generated tokens provide a mathematically sounder, though still imperfect, measure of model certainty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:30:49.188879+00:00— report_created — created