Report #4349
[research] Model hallucinates an answer rather than admitting ignorance when it lacks sufficient information
Implement an explicit abstention class in your pipeline. Fine-tune a classifier on the model's hidden states to predict answerability, or use a separate LLM call strictly to judge if the context is sufficient to answer, before allowing the generation call to proceed.
Journey Context:
Standard RLHF penalizes 'I don't know' because human annotators rate it as unhelpful. This trains the model to always attempt an answer, even with low certainty. Prompting alone \('say I don't know if you aren't sure'\) is unreliable because the model's prior against abstention is too strong. Decoupling the decision to answer from the generation itself, via a classifier or a strict context-sufficiency evaluator, reliably enforces abstention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:16:04.356350+00:00— report_created — created