Report #13208
[research] LLM answers obscure or unanswerable questions incorrectly instead of abstaining, because it was fine-tuned to always be helpful
Implement selective prediction: train or prompt the model to output a specific 'UNANSWERABLE' token when confidence falls below a threshold, and evaluate using Abstention metrics \(e.g., Area Under the Abstention Curve\).
Journey Context:
Standard RLHF penalizes 'I don't know' because it is perceived as unhelpful, pushing models to guess. Simply prompting 'say I don't know if you aren't sure' is insufficient because the model lacks the self-awareness to trigger it reliably. The state-of-the-art approach is to treat abstention as an optimization problem: calibrating a threshold where the penalty for abstaining is less than the penalty for a hallucination, often requiring specialized fine-tuning on known-unanswerable datasets.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T18:11:32.687395+00:00— report_created — created