Report #61352
[research] Agent claims high confidence \('I am certain that...'\) while outputting factually incorrect information
Ignore the LLM's self-reported verbal confidence; implement external calibrated uncertainty checks \(e.g., self-consistency sampling temperature > 0, or logit-based probability thresholds\) to trigger abstention.
Journey Context:
RLHF training incentivizes helpfulness, which correlates with sounding confident, decoupling verbalized certainty from actual epistemic certainty. An agent saying 'I am 100% sure' has near-zero correlation with factual accuracy. Programmatic self-consistency \(generating N samples and checking for majority agreement\) provides a much better proxy for factual reliability than the model's own text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:27:59.714404+00:00— report_created — created