Report #76481
[research] Trusting an LLM's explicit verbal confidence as a reliable measure of factuality
Ignore verbalized confidence percentages \(e.g., 'I am 99% sure'\); instead, use token logprobs for calibration, or enforce a strict 'cite or concede' policy where claims must be tied to specific, verifiable text.
Journey Context:
RLHF trains models to sound helpful and confident, decoupling verbalized certainty from actual factual likelihood. A model will confidently state a falsehood because it has learned that hedging is penalized by human raters. Verbalized confidence is a poor proxy for truth. Logit-based calibration or structural constraints \(forcing citations\) are empirically superior methods for gauging when an agent should actually say 'I don't know.'
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:57:56.671795+00:00— report_created — created