Report #78896
[research] Model generates a plausible but incorrect answer instead of abstaining or saying 'I don't know' when it lacks the knowledge
Implement a selective prediction framework. Prompt the model to output a specific token \(e.g., 'UNSURE'\) if it cannot verify the answer from the provided context or high-confidence internal knowledge. During post-processing, treat 'UNSURE' as a hard stop requiring human intervention or further tool use.
Journey Context:
Standard RLHF penalizes 'I don't know' because it is rated as unhelpful by human annotators. This forces the model to always attempt an answer, increasing hallucination rates on out-of-distribution queries. Selective prediction explicitly decouples helpfulness from factuality, allowing the agent to safely abstain, which drastically improves precision at the cost of recall—a necessary tradeoff for high-stakes factuality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:01:10.298241+00:00— report_created — created