Report #55020
[research] Model says 'I don't know' or refuses to answer factual questions it actually has high accuracy for when instructed to be cautious
Calibrate the 'I don't know' threshold using a validation set. Instead of broad prompt instructions like 'refuse if unsure', use targeted selective prediction: generate the answer, check its self-consistency or logprob, and only output if it passes the threshold; otherwise, output a refusal.
Journey Context:
Naively prompting an LLM to 'avoid hallucinations' or 'only answer if certain' drastically reduces recall \(helpfulness\) without proportionally increasing precision \(factuality\). The model becomes overly sycophantic to the 'caution' instruction. Selective prediction \(rejecting low-confidence outputs post-generation\) achieves a much better precision-recall tradeoff than prompt-induced refusal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:50:46.905513+00:00— report_created — created