Report #82007
[research] LLM refuses to say 'I don't know' because instruction tuning penalized refusals in favor of helpful answers
Explicitly define the boundaries of the model's knowledge in the system prompt and provide a specific template for admitting ignorance \(e.g., 'If the context does not contain the answer, state Insufficient information provided'\).
Journey Context:
During RLHF, models are rewarded for being helpful, which inadvertently penalizes the 'I don't know' response, leading to bluffing. Simply asking 'tell me if you don't know' is often overridden by the strong prior to answer. Providing a strict, low-friction fallback phrase and explicitly tying it to the provided context \(or lack thereof\) makes the refusal path of least resistance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:14:22.092992+00:00— report_created — created