Report #94024
[research] Providing a plausible but incorrect answer instead of admitting ignorance
Explicitly prompt the model with an 'escape hatch': 'If you are not certain or lack the information, respond with I don't know.' Fine-tune on datasets that reward abstention on out-of-distribution queries.
Journey Context:
RLHF heavily penalizes unhelpfulness, which inadvertently trains models to never say 'I don't know.' This creates a sycophantic failure mode where the model invents an answer rather than abstaining. Research on selective prediction shows that allowing models to abstain dramatically improves the precision of the remaining answers. An agent must know the boundaries of its knowledge.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:24:16.488743+00:00— report_created — created