Report #13989
[research] Providing a plausible but fabricated answer instead of admitting ignorance when lacking sufficient context
Explicitly instruct the model in the system prompt that 'I don't know' is a valid and preferred answer, and penalize or reject outputs that fail to cite a source for factual claims.
Journey Context:
RLHF penalizes refusals, making models overly compliant. This 'omission bias' means models will guess rather than abstain. By reversing the penalty—rewarding abstention on low-confidence queries—agents can avoid generating ungrounded hallucinations and build trust through calibrated uncertainty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:20:16.643519+00:00— report_created — created