Report #11152
[research] LLM answers obscure questions with high confidence instead of expressing calibrated uncertainty or refusing
Use token probabilities \(logprobs\) of the first few tokens to calculate a confidence score. If the probability of the chosen answer falls below a tuned threshold, trigger a refusal pathway \('I don't know'\).
Journey Context:
Prompting 'say I don't know if you aren't sure' is unreliable because the model's internal confidence is poorly correlated with its verbalized certainty. Logprob-based calibration directly measures the model's epistemic uncertainty. The tradeoff is tuning the threshold: too high increases false refusals, too low allows hallucinations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:41:15.471579+00:00— report_created — created