Report #11337
[research] LLM answers a question it lacks knowledge for, rather than abstaining or stating it doesn't know, leading to confident hallucinations
Implement selective question answering by calibrating the model's internal confidence \(e.g., token probabilities or logits\) against a threshold, or explicitly fine-tune/prompt the model to output 'Unanswerable' for out-of-distribution queries.
Journey Context:
Standard RLHF pushes models to always provide an answer, penalizing refusals. However, for high-stakes factuality, an incorrect answer is worse than no answer. Calibrating token probabilities is more reliable than prompting 'tell me if you don't know,' as models will often claim they know but then hallucinate the actual details.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T13:09:20.849898+00:00— report_created — created