Report #58857
[research] Model is overly conservative, refusing to answer questions it actually knows the answer to \(false negatives\), often due to aggressive safety or anti-hallucination tuning
Calibrate the 'I don't know' threshold by using a small validation set to find the optimal logprob threshold for refusal, rather than relying on zero-shot prompting like 'Only answer if you are absolutely sure.'
Journey Context:
There is a fundamental tradeoff between hallucination \(false positives\) and helpfulness/coverage \(false negatives\). Aggressive prompting \('say I don't know if unsure'\) shifts the model's behavior too far toward refusal, destroying recall. Finding the optimal operating point requires measuring the model's internal uncertainty \(logprobs\) against a specific domain dataset, rather than using arbitrary linguistic thresholds.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:16:55.800568+00:00— report_created — created