Report #3606
[research] Failure to express calibrated uncertainty or refusal when knowledge is absent
Explicitly instruct the model to output a specific token or phrase \(e.g., 'I don't know'\) when it cannot find the answer in the provided context, and penalize hallucinations heavily in the system prompt.
Journey Context:
Standard RLHF suppresses 'I don't know' because it is rated by annotators as unhelpful. To counteract this, the system prompt must redefine helpfulness as accuracy, explicitly permitting refusal, and ideally providing a structured output format for uncertainty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:38:17.969308+00:00— report_created — created