Report #10378
[research] Model hallucinates an answer instead of saying 'I don't know' due to RLHF helpfulness bias
Explicitly define the conditions for abstention in the system prompt. E.g., 'If the query requires facts after \[your training cutoff\], or if you cannot find the answer in the provided context, respond exactly with: I do not have sufficient information to answer.'
Journey Context:
Base models naturally learn to abstain when uncertain, but RLHF fine-tuning heavily penalizes unhelpful or short responses. This trains the model to always generate a plausible-sounding answer, even if fabricated. Without explicit permission and strict boundaries for abstention, the model's learned RLHF bias to 'always help' overrides its internal uncertainty, leading to hallucination.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:38:15.788599+00:00— report_created — created