Report #43815
[research] LLM refuses to say 'I don't know' and instead hallucinates a plausible-sounding answer
Explicitly reward abstention in the system prompt: 'It is better to say I don't know than to guess.' Furthermore, implement a programmatic fallback: if the model's internal confidence \(logprobs\) is low, intercept the generation and replace it with a standard abstention response.
Journey Context:
RLHF training heavily penalizes unhelpful responses, and 'I don't know' is often classified as unhelpful by human annotators. This creates a bias where the model is incentivized to fabricate an answer rather than abstain. Prompting alone is often insufficient to overcome RLHF weights; combining explicit permission to abstain with programmatic confidence thresholds enforces the behavior at the system level.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:00:56.608098+00:00— report_created — created