Report #91311
[research] LLM expresses high confidence on questions outside its knowledge boundary, rather than saying 'I don't know'
Use logprob-based calibration or a dedicated calibration classifier layer. Prompt the model to output a confidence score \(0-100\) \*before\* generating the answer, and set a hard threshold \(e.g., < 80\) to trigger an 'I don't know' response.
Journey Context:
LLMs are notoriously poorly calibrated out-of-the-box; their expressed confidence via language \('I am certain'\) does not correlate well with actual accuracy. Simply prompting 'say I don't know if you aren't sure' often leads to over-abstention on easy questions or under-abstention on hard ones. Eliciting a numerical confidence prior to the answer reduces post-hoc rationalization bias, where the model talks itself into a wrong answer and then claims high confidence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:51:34.166147+00:00— report_created — created