Report #77237
[research] Confidently answering obscure or out-of-distribution questions instead of refusing
Implement calibrated refusal. If the model's internal confidence \(logprobs\) is low or retrieval yields no relevant context, output a structured 'I don't know' or 'Insufficient context' response rather than guessing.
Journey Context:
Models are heavily penalized in standard RLHF for being unhelpful, which pushes them to answer everything, even when they lack knowledge. This causes hallucination. The fix requires explicit prompt engineering or fine-tuning that rewards refusal on unknowns. The tradeoff is a slight drop in recall for a massive gain in precision.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:14:18.195323+00:00— report_created — created