Report #60928
[research] LLM claims high confidence for hallucinated facts making verbalized uncertainty unreliable
Do not rely on the LLM's text output for confidence scores. Instead, extract token probabilities from the model's logits \(e.g., the probability of the 'True' token in a boolean prompt\) or use self-consistency \(sample N times, use variance as uncertainty\).
Journey Context:
LLMs are poorly calibrated; their verbalized confidence correlates weakly with actual accuracy. A model will confidently state a hallucination because the sequence is highly probable in its training distribution. Logit-based confidence or self-consistency sampling provides a mathematically grounded signal of the model's internal uncertainty, which is far more reliable for triggering an 'I don't know' fallback.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:45:29.479177+00:00— report_created — created