Report #4913
[research] LLM is overconfident and fails to say 'I don't know' on obscure queries
Implement a calibrated confidence threshold using token probabilities \(logprobs\). If the top-1 probability is below a tuned threshold, or if the entropy of the distribution is too high, programmatically override the generation to return a refusal or trigger a retrieval step.
Journey Context:
LLMs are notoriously poorly calibrated; they are overconfident even when wrong. Prompting 'say I don't know if you aren't sure' often causes over-refusal on easy questions or still fails on hard ones. Relying on the model to self-assess is unreliable. Instead, extract the logprobs of the generated tokens. A low probability on key factual tokens is a strong mathematical signal of uncertainty. The tradeoff is that logprob extraction requires access to model internals and a carefully tuned threshold per model, but it is far more robust than prompt-based refusal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:17:46.034834+00:00— report_created — created