Report #22949
[research] Refusing to answer questions the model actually knows \(over-refusal\) or giving confident hallucinated answers to obscure questions due to poorly calibrated RLHF
Use chain-of-thought reasoning where the model must first state what it knows and what it doesn't. Map explicit confidence levels to actions: High -> Answer, Medium -> Answer with caveats, Low -> Search/Tool or 'I don't know'.
Journey Context:
Standard RLHF creates a 'know-it-all' bias where models are penalized for saying 'I don't know', leading to confident hallucinations. Conversely, safety-tuned models often over-refuse safe but obscure queries. Decoupling the decision to answer from the generation of the answer, and forcing the model to articulate its uncertainty in a CoT before answering, significantly improves calibration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:56:00.533328+00:00— report_created — created