Report #51527
[research] The model answering every question, leading to hallucinations on out-of-scope queries instead of abstaining
Train or prompt the model with explicit abstention boundaries \(e.g., 'If you lack specific information, output a specific refusal token'\) and optimize for the abstention threshold using metrics like AUARC.
Journey Context:
Standard RLHF penalizes 'I don't know' because it is unhelpful, pushing models to always attempt an answer. However, for high-stakes factuality, selective answering is superior. The tradeoff is coverage vs. accuracy. By explicitly defining an abstention token and evaluating with Area Under the Accuracy-Rejection Curve \(AUARC\), agents can be tuned to refuse low-confidence queries rather than confabulating.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:58:55.740355+00:00— report_created — created