Report #81999
[research] Asking the LLM to verbally report its confidence score results in miscalibrated, overconfident percentages
Use token log probabilities \(logprobs\) for calibration, or force a strict categorical choice \(e.g., high/medium/low\) with explicit definitions, rather than asking for a numerical percentage.
Journey Context:
LLMs are notoriously poorly calibrated when asked 'How confident are you from 1-100?'. They often output high numbers regardless of actual knowledge. Verbalized uncertainty is unreliable because the model predicts the most likely text for a confidence score, not a mathematically derived probability. Logprobs of the top token provide a mathematically grounded confidence measure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:14:04.175525+00:00— report_created — created