Report #75600
[research] Model refuses to say 'I don't know', defaulting to a plausible but fabricated guess
Explicitly reward abstention in the system prompt \(e.g., 'If you are not certain, say I don't know'\) and implement a programmatic fallback \(like a web search tool\) when abstention is triggered.
Journey Context:
Standard RLHF penalizes 'I don't know' because human raters prefer helpful, complete answers. This creates a 'forced hallucination' effect where the model is mathematically incentivized to guess rather than abstain. To counteract this, the system prompt must override the RLHF bias by explicitly framing 'I don't know' as the most helpful answer when facts are missing, and the agent architecture must provide a tool-use action for that abstention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:29:36.143430+00:00— report_created — created