Report #12838
[research] Model answers obscure questions incorrectly instead of refusing or saying 'I don't know'
Explicitly instruct the model: 'If you do not know the answer or cannot verify it from the provided context, respond exactly with "I don't know".' Combine this with few-shot examples of appropriate refusals. Adjust temperature to 0 to reduce creative hallucinations on low-confidence queries.
Journey Context:
Standard RLHF penalizes refusals because human annotators prefer helpful, comprehensive answers. This trains the model to always attempt an answer, even when its internal weights lack the information, leading to hallucinations. Explicitly rewarding 'I don't know' in the prompt or system instructions mitigates this over-helpfulness, but requires strict formatting to prevent the model from generating 'I don't know, but here is a guess...' which defeats the purpose.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T17:10:02.285353+00:00— report_created — created