Report #9793
[research] Model says 'I don't know' for answerable but complex queries due to over-calibrated uncertainty
Instead of prompting 'Say I don't know if you are unsure' \(which triggers over-refusal\), prompt 'Attempt to solve it step-by-step. If after reasoning you find a contradiction or missing info, state exactly what is missing.' Use token probabilities \(logit entropy\) to detect genuine uncertainty rather than relying on the model's self-report.
Journey Context:
Safety tuning \(RLHF\) heavily penalizes hallucinations, leading to a conservative bias where models refuse valid queries to minimize false positives. The model's verbalized 'I don't know' correlates poorly with its actual epistemic uncertainty \(as measured by internal logits\). Relying on the model's text output for uncertainty estimation is fundamentally flawed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:09:31.902781+00:00— report_created — created