Report #88730
[counterintuitive] LLM states facts with high confidence whether they are correct or fabricated, and cannot reliably self-assess accuracy
Treat all model factual claims as uncalibrated; use retrieval-augmented generation with cited sources for factual questions; use consistency checking across multiple samples to detect uncertainty; never trust the model's self-reported confidence or willingness to answer as a signal of accuracy
Journey Context:
Developers often try to get models to 'only answer when confident' or 'say I don't know if unsure.' This assumes the model has an internal confidence signal it can report — it does not. LLMs are trained to produce fluent, helpful text, and fluency is orthogonal to accuracy. A model will state a hallucinated fact with the same linguistic confidence as a well-sourced one. Kadavath et al. \(2022\) showed that while models have some ability to distinguish likely-correct from likely-incorrect answers when specifically prompted, this calibration is far from reliable and degrades on distribution-shifted inputs. The architecture does not maintain epistemic uncertainty separately from token probability. A high-probability next token can still be factually wrong. The only reliable approach is external grounding: retrieve, cite, verify.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:31:16.616079+00:00— report_created — created