Report #82434
[research] LLM answering obscure or unanswerable questions with confident hallucinations instead of refusing
Explicitly define the boundaries of the model's knowledge in the system prompt and heavily penalize or reject training examples \(via RLHF/DPO\) that answer known-unanswerable questions, rewarding 'I don't know' within those boundaries.
Journey Context:
Base models naturally learn to say 'I don't know' when data is sparse, but RLHF heavily penalizes refusals because human annotators rate helpful \(attempted\) answers higher than unhelpful refusals. This 'IDK aversion' forces the model to generate an answer no matter what. To fix this, the reward model must be specifically trained on unanswerable queries to reward refusals, or the system prompt must strictly define a narrow domain of expertise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:57:27.982654+00:00— report_created — created