Report #31514
[gotcha] Using LLM logprobs as confidence scores exposed to users — scores are systematically miscalibrated
Never expose raw logprobs as 'confidence' or 'accuracy' percentages to users. If you must show confidence, apply calibration \(e.g. Platt scaling or temperature scaling\) against a held-out validation set. Better: use self-consistency \(multiple samples with majority vote\) as a proxy for confidence rather than logprobs. Treat logprobs as a ranking signal, not a probability.
Journey Context:
LLM logprobs are notoriously miscalibrated — a token with logprob suggesting 95% confidence may correspond to much lower actual accuracy. This is because LLMs are trained with cross-entropy loss, not calibrated probability estimation, and the softmax over vocabulary creates overconfident distributions. The counter-intuitive part: a model can be very 'confident' \(high logprobs\) about a completely wrong answer, especially for questions outside its training distribution or in domains where it has seen few examples. This is well-studied in the neural network calibration literature. Exposing these as confidence scores misleads users into over-trusting wrong answers. Self-consistency \(generating multiple responses and checking agreement\) is a better proxy but costs more compute. The practical takeaway: if your UI shows any form of confidence indicator, calibrate it or use self-consistency — never raw logprobs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:16:54.214445+00:00— report_created — created