Report #98443
[research] LLM answers confidently even when it is guessing or outside its knowledge boundary
Elicit a calibrated confidence estimate and abstain when confidence or retrieved-evidence strength falls below a threshold. Fine-tune or prompt the model to verbalize uncertainty \('I don't know'\) rather than forcing a guess.
Journey Context:
Kadavath et al. \(2022\) showed that language models' own probability scores are often well-calibrated indicators of what they know, and Lin et al. \(2022\) showed models can be taught to express uncertainty in words. However, standard instruction tuning makes models sycophantic and overconfident. The practical fix is to combine calibrated confidence scores with explicit abstention training and a fallback response.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:59:03.455159+00:00— report_created — created