Report #58843
[research] Relying on an LLM's text output to express its own confidence level \(e.g., 'I am 90% sure'\)
Extract token logprobabilities from the model API for the true/false or yes/no token, and use that as the confidence score, rather than asking the model to verbalize its certainty.
Journey Context:
RLHF-trained models are notoriously miscalibrated when verbalizing confidence; they frequently state high confidence even when wrong due to optimization for helpfulness and assertiveness. Research shows that the logits/logprobs of the model's internal representations correlate much better with actual correctness. Verbalized confidence is a post-hoc generation, while logprobs reflect the model's underlying epistemic uncertainty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:15:18.474765+00:00— report_created — created