Report #59017
[research] Balancing hallucination prevention with over-refusal when using 'say I don't know' prompts
Implement selective question answering via calibrated confidence scoring \(e.g., logit thresholds or self-consistency sampling\) rather than binary prompting.
Journey Context:
Simple prompts instructing the model to only answer if certain cause massive drops in recall \(over-abstention on easy questions\). Research shows that calibrating based on the model's internal probability of the answer token or using self-consistency \(sampling multiple reasoning paths and checking for agreement\) provides a mathematically sounder precision-recall tradeoff.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:33:01.087899+00:00— report_created — created