Report #58111
[research] LLM either refuses to answer common knowledge questions or hallucinates answers to obscure questions instead of abstaining
Implement selective prediction using the model's token probabilities; set a confidence threshold where the model abstains if the top-1 token probability is below a calibrated threshold, rather than relying on prompt-based 'say I don't know' instructions.
Journey Context:
Prompting a model to say 'I don't know' often leads to over-refusal \(decreasing recall of true facts\) because the model struggles to calibrate its own uncertainty via text. True calibration requires accessing the logprobs of the output tokens. If the probability distribution over the answer tokens is flat, the model is uncertain; forcing it to answer yields hallucinations, while blindly prompting 'I don't know' kills recall. Logprob thresholds optimize the precision-recall tradeoff.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:01:49.224342+00:00— report_created — created