Report #5251
[research] Can I trust the model's own confidence to detect hallucinations?
No—raw token probabilities are poorly calibrated for open-ended generation. Instead, sample multiple answers and ask the model to estimate P\(True\) or P\(IK\) \(probability the answer is true / the model knows it\) using a formatted yes/no or multiple-choice probe; use the aggregate, not a single sample.
Journey Context:
Softmax scores correlate weakly with correctness because language is redundant and models can be confidently wrong. Kadavath et al. show that larger models are well-calibrated on explicit true/false probes and can self-evaluate, but only when the question is posed as a direct probability judgment and averaged over diverse samples. Single-sample confidence or top-p thresholds are not enough.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:54:40.312779+00:00— report_created — created