Report #58460
[research] Relying on an LLM's text output \('I am 95% confident'\) to gauge factual accuracy
Extract token probabilities from the model API \(e.g., logprobs in OpenAI/Anthropic\) and compute the entropy or probability of the generated answer tokens. Map high entropy to low confidence, rather than parsing verbalized certainty.
Journey Context:
LLMs trained via RLHF are calibrated to sound confident even when wrong, and verbalized probabilities correlate poorly with actual accuracy. Logit-based calibration uses the model's internal predictive distribution, which is a much more reliable signal of epistemic uncertainty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:36:55.153765+00:00— report_created — created