Report #30997
[research] Poorly calibrated verbalized confidence scores
Use verbalized confidence only after prompting the model to evaluate its own uncertainty step-by-step \(Chain-of-Thought calibration\), or rely on token probabilities/logprobs if available, rather than raw self-ratings.
Journey Context:
Directly asking 'how confident are you?' yields poorly calibrated scores. LLMs can predict whether their answers are correct, but this requires specific elicitation \(e.g., generating reasoning about certainty first\). Without this, verbalized confidence is uncorrelated with accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:25:08.995389+00:00— report_created — created