Report #9698
[research] Model says 'I am highly confident' but is factually wrong, or says 'I'm not sure but...' and then gives the correct answer anyway
Do not rely on the model's self-reported confidence \(verbalized uncertainty\) as a reliable proxy for factual accuracy. Instead, use token logprobs \(if available\) or an external calibration model/verifier to assess factual certainty, and set strict thresholds for abstention.
Journey Context:
Prompting a model to 'state your confidence' often results in it mimicking human hedging language rather than reporting true epistemic uncertainty. A model might output '99% confident' for a completely fabricated fact because the tokens are locally highly probable. True calibration requires looking at the underlying probability distributions or using a separate verification step.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:49:20.722432+00:00— report_created — created