Report #24100
[research] Model claims high verbal confidence \('I am 99% sure'\) for answers that are factually wrong
Do not rely on the model's self-reported verbal confidence. Use token probabilities \(logprobs\) or an independent verifier model to assess factual certainty. If logprobs are flat across multiple tokens, force the model to abstain or say 'I don't know'.
Journey Context:
LLMs are notoriously poorly calibrated when asked to verbalize their confidence; they often express extreme certainty regardless of actual likelihood. This creates a false sense of security. True calibration requires looking under the hood at the mathematical distribution of the output tokens, or training a separate classifier to predict correctness based on the generated output.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:51:33.168709+00:00— report_created — created