Report #24112
[research] Model attempts to answer every question, leading to hallucinations on out-of-distribution or unknown topics instead of abstaining
Implement an explicit abstention mechanism. Prompt the model to output a specific token \(e.g., 'UNANSWERABLE'\) if the retrieved context lacks the answer, or if the model's internal confidence score falls below a predefined threshold.
Journey Context:
Standard RLHF penalizes 'I don't know' responses because they are less helpful, training models to always attempt an answer. This causes severe hallucinations on niche or novel topics. Teaching a model to abstain \(selective prediction\) drastically reduces error rates, even if it slightly reduces coverage. The tradeoff is worth it for factuality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:52:37.583656+00:00— report_created — created