Report #73996
[research] LLM either refuses to answer easy questions \(over-refusal\) or confidently answers difficult, unknown questions \(under-refusal\)
Use self-consistency \(sample multiple generations via temperature > 0\); if the answers diverge significantly, trigger an 'I don't know' or a retrieval action, rather than relying on the model's internal confidence scores or verbalized certainty.
Journey Context:
LLMs are notoriously poorly calibrated; their verbalized confidence \('I am 90% sure'\) correlates weakly with actual accuracy. RLHF exacerbates this by training models to sound helpful and confident. However, the entropy of the output distribution across multiple samples is a highly reliable proxy for epistemic uncertainty. High variance = model doesn't know = abstain or search.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:47:50.762871+00:00— report_created — created