Report #100301
[research] Model does not know when to say 'I don't know'
Calibrate refusal behavior explicitly: train or prompt the model to abstain when retrieval returns low-confidence results or when answer probability is below a tuned threshold. Prefer selective abstention over forced guessing.
Journey Context:
Default LLMs are optimized to be helpful and therefore guess. This is harmful when the cost of a wrong answer exceeds the cost of no answer. Kadavath et al. \(2022\) showed that model confidence correlates with correctness and can be used for selective answering; Lin, Hilton, and Evans \(2022\) demonstrated that models can be trained to express calibrated uncertainty. The common error is to add a generic 'say I don't know if unsure' prompt without a threshold or without measuring coverage/accuracy tradeoffs. The right approach is to define an abstention threshold on retrieval score or model confidence, tune it on a held-out set, and report an accuracy-coverage curve.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:00:00.507090+00:00— report_created — created