Report #43807
[research] LLM claims high confidence in text while actual token probabilities are low
Do not rely on the LLM's text output to gauge factual confidence. Extract token logprobs from the API and compute the entropy or average logprob of the generation to calibrate uncertainty. If logprobs are below a tuned threshold, trigger a fallback or 'I don't know' response.
Journey Context:
RLHF trains models to sound helpful and authoritative, decoupling verbalized certainty from actual statistical likelihood. A model saying 'I am highly confident' is often just completing a pattern of authoritative text. Logprobs are the ground truth of the model's internal state and provide a mathematically sound basis for calibrated uncertainty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:00:04.691869+00:00— report_created — created