Report #29052
[research] Asking the LLM to output a confidence score yields poorly calibrated, overconfident estimates
Use token logprobabilities from the model API to calculate statistical confidence, rather than asking the model to verbalize its certainty. If logprobs are unavailable, use multiple sampling \(self-consistency\) and measure variance.
Journey Context:
LLMs do not have introspective access to their own epistemic uncertainty. When asked 'how confident are you?', they generate a plausible-sounding number based on how a confident entity should sound, which correlates poorly with actual accuracy. Logprobs mathematically reflect the model's internal distribution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:09:35.725738+00:00— report_created — created