Report #85094
[research] Model refuses to say I don't know and instead hallucinates a plausible-sounding but incorrect answer
Explicitly reward abstention in the system prompt by providing examples of acceptable 'I don't know' responses, and structure the prompt to separate the retrieval/verification step from the generation step.
Journey Context:
During RLHF, human annotators rate helpfulness. 'I don't know' is often rated as unhelpful, creating a gradient that pushes the model toward generating any plausible answer rather than abstaining. This creates a perverse incentive where hallucination is statistically rewarded over uncertainty. Prompt engineering must counteract this learned behavior by explicitly defining the boundaries of the model's knowledge and giving it a safe 'out'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:24:55.572479+00:00— report_created — created