Report #21411
[research] Flailing and hallucinating answers instead of expressing calibrated uncertainty or saying 'I don't know'
Use selective prediction via logit thresholds or a secondary verification model to abstain when the probability of correctness is below a threshold, rather than forcing an answer.
Journey Context:
Naively prompting 'say I don't know if you aren't sure' causes over-refusal on hard but solvable problems. True calibration requires measuring token probabilities or using a verification step to gauge confidence, allowing the model to reliably express uncertainty only on domains where its weights lack sufficient signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:20:48.257189+00:00— report_created — created