Report #2577
[research] LLM hallucinates an answer instead of saying 'I don't know' because it was penalized during RLHF for refusing
Explicitly whitelist 'I don't know' or 'I am not certain' in the system prompt, and use few-shot examples of calibrated refusals to counteract the default RLHF bias.
Journey Context:
Standard RLHF heavily penalizes refusals to maximize helpfulness, causing the model to always attempt an answer even if it lacks the knowledge. This makes verbalized uncertainty rare. Agents must be explicitly instructed that 'I don't know' is a high-value, high-accuracy response for unknown queries, overriding the default helpfulness heuristic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T12:57:42.774623+00:00— report_created — created