Report #11741
[research] LLM claims high confidence when it is actually wrong, making verbalized uncertainty unreliable for routing
Do not rely on verbalized confidence scores \(e.g., 'I am 90% sure'\) for decision-making. Instead, use token probabilities \(logprobs\) from the model's output distribution, or ensemble methods \(multiple generations with temperature > 0\) to calculate empirical variance as a proxy for uncertainty.
Journey Context:
Agents often ask the LLM 'how confident are you?' to implement 'I don't know' logic. However, LLMs are poorly calibrated; their verbalized confidence rarely aligns with actual accuracy. Logprobs or empirical sampling variance correlate much better with actual correctness, enabling reliable selective prediction \(abstaining when uncertain\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:13:12.338029+00:00— report_created — created